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Consider a two-class classification problem when the number of 
features is much larger than the sample size. The features are masked 
by Gaussian noise with zero means and a covariance matrix E, where 
the precision matrix ft = is unknown but is presumably sparse. 
The useful features, also unknown, are sparse and each contributes 
weakly (i.e., rare and weak) to the classification decision. 

By obtaining a reasonably good estimate of SI, we formulate the 
setting as a linear regression model. We propose a two-stage clas- 
sification method where we first select features by the method of 
Innovated Thresholding (IT), and then use the retained features and 
Fisher's LDA for classification. In this approach, a crucial problem is 
how to set the threshold of IT. We approach this problem by adapting 
the recent innovation of Higher Criticism Thresholding (HCT). 

We find that when useful features are rare and weak, the limiting 
behavior of HCT is essentially just as good as the limiting behavior 
of ideal threshold, the threshold one would choose if the underlying 
distribution of the signals is known (if only!). Somewhat surprisingly, 
when Q, is sufficiently sparse, its off-diagonal coordinates usually do 
not have a major influence over the classification decision. 

Compared to recent work in the case where Q is the identity ma- 
trix [15, 16], the current setting is much more general, which needs a 
new approach and much more sophisticated analysis. One key com- 
ponent of the analysis is the intimate relationship between HCT 
and Fisher's separation. Another key component is the tight large- 
deviation bounds for empirical processes associated with data with 
sparse but unconventional correlation structure, where the separabil- 
ity of sparse graphs plays an important role. 
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1. Introduction. Consider a two-class classification problem, where 
we have n labeled training samples {Xi,Yi),l < i < n. Here, Xi are p- 
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dimensional feature vectors and Yi G {—1, 1} are the corresponding class 
labels. For simplicity, we assume two classes are equally likely, and the data 
are centered so that 

(1.1) X, ~ N{Yi ■ 11, Sp,p), 

where ^ is the contrast mean vector between two classes, and Sp^p is the 
p X p covariance matrix. Given a fresh feature vector 

(1.2) X^N(Y- fi, Sp,p), 

the goal is to train (Xj, Yi) to decide whether y = — lory = l. We denote 
S~p by rip^p, and whenever there is no confusion, we drop the subscripts 
(and also that of any estimator of them, say, r2p,p). 
We are primarily interested in the so-called 'p ^ n' regime. In many 
applications where p ^ n (e.g., genomics), we observe the following aspects. 

• Signals are rare. Due to large p, the useful features (i.e., the nonzero 
coordinates of /i) are rare. For example, for a given type of cancer or 
disease, there are usually only a small number of relevant features (i.e., 
genes or proteins). When we measure increasingly more features, we 
tend to include increasingly more irrelevant ones. 

• Signals are individually weak. The training data can be summarized 
by the z-vector 

1 " 

(1.3) Z = —y^YiXir^N{^^i,^). 

Due to the small n, signals are weak in that, individually, the nonzero 
coordinates of \/nn are small or moderately large at most. 

• Precision matrix is sparse. Take Genetic Regulatory Network (GRN) 
for example. The feature vector X = (X(l), . . . , X{p)y represents the 
expression level of p different genes, and is approximately distributed 
as A^(/U, S). For any 1 < z < p, it is believed that for all except a few 
j, I < j < p, the gene pair are conditionally independent given 
all other genes. In other words, each row of O has only a few nonzero 
entries and so 17 is sparse [12]. 

In many applications, 0, is unknown and has to be estimated. In many other 
applications such as complicate disease or cancer, decades of biomedical 
studies have accumulated huge databases which are sometimes referred to 
as "data-for-data" [33]. Such databases can be used to accurately estimate 
Q independently of the data at hand, and so can be assumed as known. 
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In this paper, we investigate both the case where 0, is known and the case 
where is unknown. In either case, we assume O has unit diagonals: 

(1.4) n{i,i) = l, l<i<p. 

Such an assumption is only for simplicity, and we don't use such information 
for inference. 

1.1. Fisher's LDA and modern challenges. Fisher's linear discriminant 
analysis (LDA) [20] is a well-known method for classification, which utilizes 
a weighted average of the test features L{X) = Yl'j=i ^(j)-^(j)) predicts 
y = ±1 if L{X) >< 0. Here, w = {w{l), . . . ,w{p))' is a preselected weight 
vector. Fisher showed that the optimal weight vector satisfies 

(1.5) w oc Qfi. 

In the classical setting where p, and can be conveniently estimated 
and Fisher's LDA is approachable. Unfortunately, in the modern regime 
where p ^ n, Fisher's LDA faces immediate challenges. 

• It is challenging to estimate Q simply because that there are 0{p'^) 
unknown parameters but we have only 0{np) different measurements. 

• Even when is known and even in the simplest case where Q = Ip, 
challenges remain, as the signals are rare and weak. See [15] for the 
delicacy of the problem. 

The paper is largely focused on addressing the second challenge, and 
shows successful classification can be achieved by simultaneously exploiting 
the sparsity of /i (aka. signal sparsity) and the sparsity of Q (aka. graph 
sparsity). For the first challenge, encouraging progresses have been made 
recently (e.g., [21, 9]), and the problem is more or less settled. Still, the 
paper has a two-fold contribution along this line. First, we show that the 
performances of the methods in [21, 9] can be substantially improved if we 
add an additional re-fitting step; see details in Section 4. Second, we carefully 
analyze how the errors in estimating O may affect the classification results. 

1.2. Innovated Thresholding. We wish to adapt Fisher's LDA to the cur- 
rent setting. Recall that the optimal choice of weight vector is zz; oc O/i. If we 
have a reasonably good estimate of O (see Section 1.8 for more discussion 
on estimating Q), say, (l, all we need is a good estimate of /i. 

When /i is sparse, one usually estimates it with some types of wavelet 
thresholding [41]. Let Z be the training z-vector as in (1.3). For some thresh- 
old t to be determined, there are three obvious approaches to thresholding: 
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• Brute-force Thresholding (BT). Applying thresholding to Z directly 
using the so-called clipping rule [15]: /if (i) = sgn(Z(i))l{|Z(i)| > i} 
(alternatively, one may use soft-thresholding or hard thresholding [15], 
but the differences are secondary; similar below). 

• Whitened Thresholding (WT). We first whiten the noise by the trans- 
formation Z I— 7- Q}/'^Z ~ N{il,^/'^ fj,, Ip), and then apply the threshold- 
ing to the vector Cl^/'^Z in a similar fashion. 

• Innovated Thresholding (IT). We first take the transformation Z i— )• 
ClZ and then apply the thresholding by 

(1.6) /if (i) = sgn{Z{i))l{\Z{{)\ > t)}, where Z = ClZ. 

The transformation Z i— )■ ClZ is connected to the term of Innovation in the 
literature of time series [23], and so the name of Innovated Thresholding. 
Which of the three approaches is the best? 

It turns out IT is the best. To see the point, note that for any p x p non- 
singular matrix M, one could always estimate fi by applying the thresholding 
to MZ entry-wise (in BT, WT, and IT, M = Ip, fi^/^, and Q approximately). 
The deal is, what is the best M? 

Towards this end, write M = [mi,m2, ■ ■ ■ ,mp]' . For any 1 < i < p, it 
is seen that {MZ){i) ~ N{^/nm[ii,m[T,mi). Therefore, if we bet on ^{i) / 
0, we should choose rrii to optimize the Signal to Noise Ratio (SNR) of 
{MZ)[i). By Cauchy-Schwarz inequality, the optimal mi satisfies that oc 
il/i. Writing = [a;i,a;2, • • • ,Wp], it is seen that 

(1.7) = ^i{i)iOi + fi{k)ujk = (/) + (//). 

When we bet on /i(i) ^ 0, (/) oc coi which is accessible to us. However, (//) 
is a very noisy vector and is inaccessible to us, estimating which is equally 
hard as estimating n itself. The point can be further elaborated as follows: 
since we don't know the locations of other nonzero coordinates of fi, it makes 
sense to model {^/nfj,(j) : 1 < j < p, j 7^ i} as iid samples from 

(1.8) (1 - ep)h'o + epHp, Cp > 0: small, 

where vq is the point mass at and Hp is some distribution with no mass 
at 0. Under general "rare and weak" conditions for /i and sparsity condition 
for J7, coordinates of E[{II)] are uniformly small. This suggests that {II) is 
generally non-informative in designing the best rui, and all we could utilize 
is (/). 
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In summary, if we bet on fi{i) ^ 0, the optimal choice is rrii oc Wj. As this 
holds for all i and we don't know where the signals are, the optimal choice 
for M is M = Q. This says that IT is not only the best among the three 
choices above, but is also the best choice in more general context. 

In the literature of variable selection, IT is also called marginal regression 
[22]. The connection is not surprising, as approximately, Cl^^'^Z ~ ~ 
N{y/nQ}/'^ IJL^ Ip) which is a regression model. Both methods apply thresh- 
olding to QZ entry-wise, but marginal regression uses the hard thresholding 
rule, and IT uses the clipping thresholding rule [15]. 

With that being said, challenges remain on how to set the threshold t 
of IT (see (1.6)). If we set t too small or too large, the resultant estimator 
(if has too many or too few nonzeros. Our proposal is to set the threshold 
in a data driven fashion by using the recent innovation of Higher Criticism 
Thresholding (HCT) 

1.3. Threshold choice by Higher Criticism. Higher Criticism (HC) is a 
notion mentioned in passing by Tukey [40]. In recent years, HC was found to 
be useful in sparse signal detection [14], large-scale multiple testing [2, 7, 42], 
goodness-of-fit [29], and was applied to nonGaussian detection in Cosmic 
Microwave Background [11] and genomics [25, 35]. HC as a method for 
threshold choice in feature selection was first introduced in [15] (see also 
[24]), but the study has been focused on the case where 17 is the identity 
matrix. The case we consider in the current paper is much more complicated, 
where how to use HC for threshold choice is a non-trivial problem. 

Our proposal is as follows. Let be a reasonably good estimate of and 
let Z be the training z-vector as in (1.3). As in (1.6), denote for short 

(1.9) z = z{z,Ci,p,n) = nz. 

The proposed approach contains three simple steps. 

• For each 1 < j < p, obtain a p-value by ttj = P(|A^(0, 1)| > |Z(j)|). 

• Sort all the values in the ascending order 7r(i) < 7r(2) < . . . < vr(p). 

• Define the HC functional HCpj = y/p[j/p — '^{j)]/ \/(l — j/p)j/p, 1 < 
j < p. Let j be the index at which HCpj takes the maximum. The 
Higher Criticism Threshold (HCT) — denoted by \Zq-j\ — is defined as 

the j-th. largest coordinate of (|Z(1)|, . . . , |Z(p)|)'. 
Moreover, for stability, we need the following refinement. Define 



(1.10) s; = V21og(p), s;^n = s;,n = \/2max{0,log(p/n2)}. 
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It is well-understood (e.g., [14, 23]) that we should not allow the threshold 
to be larger than s*. At the same time, we should not allow the threshold 
to be too small, especially when n is small. The Higher Criticism Threshold 
(HCT) we use in this paper is 



(1.11) 



"p 



Q* if I 7 - I ^ c* 



See Sections 1.5 and 3 for more detailed discussions. 



1.4. HCT trained classifier. We are now ready for classification. Let f2 
be as above, and let = fi^{Z, ^,p, n) be defined as 

(1.12) Alc(i) = sgn(^(i)) • i{\z{j)\ > l<J<p. 

Compared to fif in (1.6), the only difference is that we have replaced t by 
tp*-^. Introduce the HCT classification statistic 

(1.13) Lhc{X, Cl) = Lhc{X, Cl; Z,p, 71) = {fiicy^Z. 

The HCT trained classifier (or HCT classifier for short) is then the decision 
rule that decides Y = ±1 according to Lnci^i ^) >< 0. 

The innovation of the procedure is two-fold: using IT for feature selection 
and using HCT for threshold choice in the more complicated case where Vt 
is unknown and is non-identity. The work is connected to other works on 
HC [23, 15], but the procedure and the delicate theory it entails are new. 

A natural question is that whether IT has any advantages over exsiting 
variable selection methods (e.g., the Lasso [38], SCAD [19], Dantzig selector 
[10]). The answer is yes, for the following reasons. First, compared to these 
methods, IT is computationally much faster and much more approachable 
for delicate analysis. Second, our goal is classification, not variable selection. 
For classification, especially when features are rare and weak, the choice of 
different variable selection methods is secondary, while the choice of the 
tuning parameter is crucial. The threshold of IT can be conveniently set by 
HCT, but how to set the tuning parameter of the Lasso, SCAD, or Dantzig 
Selector remains an open problem, at least in theory. 

How does the HCT classifier behave? In Sections 1.5-1.6, we set up a 
theoretic framework and derive a lower bound for classification errors. In 
Sections 1.7-1.8, we investigate the HCT classifier in the case where is 
known and in the case where is unknown separately, and show that the 
HCT classifier yields optimal phase diagram in classification. 
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1.5. Asymptotic Rare and Weak model. Motivated by the application 
examples aforementioned, we use a Rare and Weak signal model as follows. 
We model the scaled contrast mean vector ^/nfl as 

(1.14) {l-€.p)vQ + epHp, l<j<p, 

where as in (1.8), vq is the point mass at 0, Hp is some distribution with no 
mass at 0, and G (0, 1) is small (note that (ep. Hp) depend on p but not on 
j). We use p as the driving asymptotic parameter, and link {n,ep,Hp) to p 
through some fixed parameters. In detail, fixing parameters {/3,9) € (0, 1)^, 
we model 

(1.15) ep=p~'^, n = np=p^. 

As p tends to oo, the sample size rip grows to oo but in a slower rate than 
that of p; the signals get increasingly sparser but the number of signals tends 
to oo. The interesting range of parameters {/3,6,Hp) partitions into three 
regimes, according to the sparsity level. 

• Relatively Dense (RD). In this regime, < /3 < (1 — 0)/2. The sig- 
nals are relatively dense and successful classification is possible even 
when signals are very faint (e.g.. Hp concentrates its mass around a 
term Tp <^ y^2 log(p)). In such cases, (a) successful feature selection is 
impossible as signals are too weak, and (b) feature selection is unnec- 
essary for the signals are relatively dense. 

• Rare and Weak (RW). In this regime, {I - 9)/2 < /3 < (1 - 9), 
and the signals are moderately sparse. For successful classification, 
we need moderately strong signals (i.e., nonzero coordinates of ^/nfj, x 
y^log(p)). In this case, feature selection is subtle but could be sub- 
stantially helpful. In contrast, classification is impossible if signals are 
much weaker than y^log(p), and feature selection is trivial if signals 
are much stronger than Y^log(p). 

• Rare and Strong (RS). In this regime, /3 > (1 — 6), and the signals are 
very sparse. For successful classification, we need very strong signals 
(signal strength ^ y^log{p)). In this case, feature selection is compa- 
rably easier to carry out (but substantially helpful) since the signals 
are strong enough to stand out for themselves. 

While the statements hold broadly, the most transparent way to under- 
stand them is probably to consider the case where Hp is a point mass at Tp 
(say): in the above three regimes, the minimum Tp required for successful 
classification (up to some multi-log(p) factors in the first and last regimes) 
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are l/{epy^{p/np)), ^ylog{p), and y^rip/ (pep) correspondingly; the proof is 
elementary so is omitted. 

In summary, feature selection is impossible in the RD regime and is rela- 
tively easy in the RS regime. For these reasons, we are primarily interested 
in the RW regime where we assume 

(1.16) (l-e)/2</3< (1-0). 

The RD/RS regimes are further discussed in Section 1.10, where we address 
the connection between our work and [18, 8]. For f3 in this range, the most 
interesting range for the signal strength is when Hp concentrates its mass at 
the scale of y^log{p). In light of this, we fix r > and calibrate the signal 
strength parameter Tp by 

(1.17) Tp = ^2rlog(p). 

Except in Section 1.6 where we address the lower bound arguments, we 
assume Hp is a point mass (compare (1.14)): 

(1.18) Hp = UT-p, where Tp = y^2r log(p) is as in (1.17) and < r < 1. 

We focus on the case < r < 1, as the case r > 1 corresponds to RS 
regime where the classification is comparably easier. This models a setting 
where the signal strengths are equal. The case where the signal strengths 
are unequal is discussed in Section 1.10. 

Next, we model Motivated by the previous example on Genetic Regu- 
latory Network, we assume each row of has relatively few nonzeros. Such a 
matrix naturally induces a sparse graph Q = {V, E), where 1^ = {1, 2, . . . ,p} 
and there is an edge between node i and j if and only if r2(i, j) 7^ 0. 

Definition 1.1. Fix 1 < Kp < p. We call Kp-sparse if and only if 
each row of VI has at most Kp nonzeros, and we call Q Kp-sparse if and only 
if the maximum degree < Kp. 

The class of ETp-sparse graphs is much broader than the class of banded 
graphs (we call G a banded graph with bandwidth K if nodes i and j are 
not connected whenever \i — j\ > K). In fact, even when Q is -fCp-sparse with 
Kp = 2, we can not always shuffle the nodes of G and make it a banded 
graph with a small bandwidth. 

Let Mp be the class of all p x p positive definite correlation matrices. 
Fixing a G (0, 1), 6 > 0, and a sequence of integers Kp, introduce 

(1.19) Mp{a,Kp) = {9, e Mp and is /<p-sparse, \^{i,j)\ < a, i ^ j}, 
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and 

(1.20) M*p{a,b,Kp) = {0 G M*p{a, Kp),\\n-^\\ < b}, 

where || • || is the spectral norm. In comparison, M*{a,b,Kp) is shghtly 
smaher than M*{a,Kp). The following short-hand notation is frequently 
used in this paper. 

Definition 1.2. We use Lp to denote a strictly positive generic multi- 
log(p) term that may vary from occurrence to occurrence but always satisfies 
that for any fixed c > 0, Y\m.p^oo{LpP~'^} = and \\m.p^oo{LpP^} = oo. 

In this paper, we are primarily interested in the case where Kp is at most 
multi-logarithmically large unless stated otherwise: 

(1.21) lim K„ = oo, K„< L„; 

the first requirement is only for convenience. In our classification setting, 
Xi ~ N{Yifi, S), X ~ N{Yfi, S), and Y = ±1 with equal probabilities. The 
following notation is frequently used in the paper. 

Definition 1.3. We say the classification problem (1.1)-(1.2) satisfies 

the Asymptotic Rare Weak model ARW{l3,r,e,9) if (1.14)-(1.15), (1.18), 
and (1.21) hold. 

1.6. Lower bound. Introduce the the standard phase boundary function 

f 0, 0</?<l/2, 

(1.22) p(/3) = { 1/2, 1/2 < /? < 3/4, 

[ (1 - ^/W)2, 3/4 < /3 < 1, 

and let 

pm = (1 - e)p{fi/{i - 6)), (1 - e)/2 <p<{i-e). 

The function p has appeared before in determining phase boundaries in a 
seemingly unrelated problem on multiple hypothesis testing [26, 27, 14]. The 
following theorem is proved in Section 5. 

Theorem 1.1. Fix {13, r, 6) e {0,lf such that {1 - e)/2 < p < {1 - 9) 
andO <r < p*g{(3). Suppose (1.14)-(1.15), (1.17), and (1.21) hold and that 
for sufficiently large p, 0, £ Ai*{a,Kp) and the support of Hp is contained 
in [—Tp,Tp]. Then as p ^ oo, for any sequence of trained classifiers, the 
mis classification error > 1/2. 
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Note that in Theorem 1.1, we don't require the signals to have the same 
strengths. Also, recall that in our classification setting (1.1)-(1.2), two classes 
are assumed as equally likely; extension to the case where two classes are 
unequally likely is straightforward. Theorem 1.1 was discovered before in 
[15, 28], but the study has been focused on the case where Q. = Ip and Hp 
is the point mass at Tp. The proof in the current case is much more difficult 
and needs a few tricks, where separability of sparse graphs plays a key role. 

Lemma 1.1. Fix a sufficiently large p and 1 < Kp < p and suppose 
Q = {V, E) is a Kp-sparse graph. There is a constant C > such that the 
graph decomposes into at most CKplog{p) different disjoint subsets, where 
in each subset, there is no edge between any pair of nodes. 

Lemma 1.1 is proved in Section 5. The proof uses pigeon-hole principle 
and is elementary, but the result has far-reaching implications. Lemma 1.1 
is the corner stone for proving the lower bound and for analyzing the HCT 
classifier (where we need tight convergence rate of empirical processes for 
data with non-conventional correlation structures). 

1.7. HCT achieves optimal phase diagram in classification (Q is known). 
One noteworthy aspect of HCT classifier is that it achieves the optimal phase 
diagram. In this section, we show this for the case where O is known. In this 
case, the HCT classifier Lnci^, ^) reduces to Lhc{X^ ^) (tlie term formed 
by replacing Cthj everywhere in the definition of former). The following 
theorem is proved in Section 5. 

Theorem 1.2. Fix (/3, r, 6, a) G (0, 1)^ such that {l-e)/2 < 13 < {1-9) 
and r > p*Q {(3) . Consider a sequence of classification problems ARW{I3, r,6,Q) 
with Q G A4*{a,b,Kp) for sufficiently large p. Then as p tends to oo, 
P(Y ■ Lhc[X, r^) < O) — )• 0. When r < (3, the condition on Q. can be relaxed 
to that ofQ€ M*p{a,Kp). 

Call the two-dimensional space {(/3,r) :0</3<l,0<r<l} the 
phase space. Theorems 1.1-1.2 say that the phase space partitions into two 
separate regions. Region of Impossibility and Region of Possibility, where 
the classification problem is distinctly different. 

• Region of Impossibility. {{13, r) : (1 - 0)/2 < (3 < {I - 6),Q < r < 
Pg(/3)}. Fix {13, r) in the interior of this region and consider a se- 
quence of classification problems with p^~^ signals where each signal 
< Y^2r log(p) in strength. Then for any sequence of 'sparse' sue- 
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cessful classification is impossible. This is the most difficult case where 
not much can be done for classification aside from random guessing. 
• Region of Possibility. {(/3,r) : (1 - 6l)/2 < /3 < (1 - 0)},pl{l3) < 
r < 1}. Fix {/3,r) in the interior of this region and suppose signals 
have equal strength of y^2r log(p). HCT classifier Lhc{X,^) yields 
successful classification (the results hold much more broadly where 
equal signal strength assumption can be largely relaxed). 

We call the curve r = /9g(/3) the separating boundary. Somewhat surprisingly, 
the separating boundary does not depend on the off-diagonals of 17. The 
partition of phase diagram was discovered by [15, 31], and independently 
by [28], but where the focus was on the case where Q = Ip. See also [24]. 
The study in the current case is much more difficult. Similar phase diagram 
was also found in sparse signal detection [14], variable selection [30], and 
spectral clustering [32]. 

Why HCT works? The key insight is that there is an intimate relationship 
between the HC functional and Fisher's separation; the latter plays a key 
role in determining the optimal classification behavior, but is, unfortunately, 
an oracle quantity which depends on unknown parameters. In Sections 2-3, 
we outline a series of theoretic results, explaining why the HCT classifier is 
the right approach and how it achieves the optimality. 

1.8. Optimality of HCT classification (Q is unknown). When Q is un- 
known, we first estimate it with the training data. 

Definition 1.4. For any sequence ofVtp^p G 7W*(a,i^p), we say an esti- 
mator Clp^p is acceptable if it is symmetric and independent of the test vector 
X, and that there is a constant C > such that for sufficiently large p, Clp^p 
is Kp-sparse where K'p < Lp, and \Clp^p{i, j)—VLp^p{i, j)\ < C Kp y/log{p) / ,,yn^ 
for all 1 < i, j < p. 

Usually, the {Lp/ y/n^)-rate can not be improved, even when 0, is diagonal. 
For Kp-sp&ise Q satisfying (1.21), acceptable estimators can be constructed 
based on the recent CLIME approach by [9]. If additionally Q satisfies the 
mutual incoherence condition [34, Assumption 1], then the glasso [21] is 
also acceptable, provided the tuning parameters are properly set. If Q is 
banded, then the Bickel and Levina Thresholding (BLT) method [4] is also 
acceptable, up to some modifications. 

With that being said, the numeric performances of all these estimators can 
be improved with an additional step of re- fitting. See Section 4 for details. 

Naturally, the estimation error of Cl has some negative effects on the HCT 
classifier. Fortunately, for a large fraction of parameters (/3, r) in Region of 
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Possibility, such effects are negligible and HCT continues to yield successful 
classification. In detail, suppose 

• Condition (a), r > max{(l - 20)/4, pg(/3)}, 

• Condition (b). When < 6* < 1/3 and (1 - 9)/2 < /3 < (1 - 29), 
\r - VI - 26*1 > ^/l-26- p. 

The following theorem is proved in Section 5. 

Theorem 1.3. Fix {/3,r,e,a) G (0, 1)^ such that {l-9)/2 < /3 < (1-9), 
and Conditions (a)-(h) hold. Consider a sequence of classification problems 
ARW{I3, r, 9, such that G A^* (a, K^) when r < /? and Vt £ M*{a, b, Kp) 
when r > (3. For the HCT classifier Lhc{^^^)j if ^ is acceptable, then as 
p tends to oo, P{Y ■ Lhc{X,(i) < 0) 0. 

We remark that, first, when < < 1/4 and (1 - 6')/2 < /3 < 3(1 - 26*) /4, 
Condition (a) can be relaxed to that of r > max{/3/3, ^^(/S)}. Second, when 
9 > 1/2, Conditions (a)-(b) automatically hold when r > Pg{/3). As a result, 
we have the following corollary, the proof of which is omitted. 

Corollary 1.1. When 9 > 1/2, Theorem 1.3 holds with Conditions 
(a)-(b) replaced by that of r > Pg{f3). 

This says that as long as Up ^ ^P, the estimation errors of any acceptable 
estimator have negligible effects over the classification decision. 

1.9. Comparison with BT and WT. In disguise, many methods are what 
we called 'Brute-forth Thresholding' or 'BT', including but not limited to 
[3, 17, 39]. Since is hard to estimate, Bickel and Levina [3] and Tibshirani 
et al [39] neglect the off-diagonals in S for classification. In a seemingly dif- 
ferent spirit, Efron [17] proposes a procedure where he first selects features 
by neglecting the off-diagonals in S and then estimates the correlation struc- 
tures among selected features. However, under the Rare and Weak model, 
selected features tend to be uncorrelated. Therefore, at least for many cases, 
the approach fails to exploit the 'local' graphic structure of the data and is 
'BT' in disguise. It is also noteworthy that [39] proposes to set the threshold 
of BT by cross validation, which is unstable, especially when Up is small. 

When we replace IT by either BT or WT in HCT classifier, the phase 
diagram associated with the resultant procedure is no longer optimal. While 
the claim holds very broadly, it can be conveniently illustrated with a simple 
case, where p is even, is known and equals to the block diagonal matrix 
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calibrated by a parameter h £ (—1, 1) and where for all 1 < i, j < p, 

(1.23) = l{i = j} + h-l{j — i = 1,2 is odd} + h-l{i—j = l,z is even}. 

In this simple case, we have the following theorem, the proof of which is 
elementary so is omitted (a similar claim holds for WT if we replace (1 — /i^) 
by (1 + VI - /i2)/2 below). 

Theorem 1.4. Fix {13, 9, r) e (0, 1)^ such that (1 - 6')/2 < /3 < (1 - 6*) 
andpliP) <r < p*g{P)/{l-h'^). Suppose (1.18) and (1.23) hold. Asp ^oo, 
the classification error of HCT classifier tends to 0, hut the HCT classifier 
with IT replaced by BT tends to 1/2, even when the threshold is ideally set. 

1.10. Comparison with works focused on the RS Regime. In disguise, 
many recent works focused on the "Rare and Strong" regime according to 
our terminology. One example is [36] , where they assume the minimum signal 
strength (smallest coordinate in magnitude of ^jTi^p) is of the order of ^/n^- 
Other examples include the ROAD approach by Fan et al. [18] and LPD 
approach by Cai et al. [8], where the main results (i.e., [18, Theorem 3], [8, 
Theorem 1]) assume a sparsity constraint that can be roughly translated to 
/3 > (1 — 9/2) in our notations. Seemingly, this concerns the RS Regime we 
mentioned earlier. 

Compared to these works, our work focuses on the most challenging 
regime where the signals are Rare and Weak, and we need much more so- 
phisticated methods for feature selection and for threshold choices. 

1.11. Comparison with other popular classifiers. HCT classifier also has 
advantages over well-known classifiers such as the Support Vector Machine 
(SVM) [6], Random Forest [5], and Boosting [13]. These methods need tuning 
parameters and are internally very complicated, but they do not outperform 
HCT classifier even when we replace the IT by BT; see details in [15], where 
we compared all these methods with three well-known gene microarray data 
sets in the context of cancer classification. 

HCT is also closely related to PAM [39], but is different in important ways. 
First, HCT exploits the correlation structure while PAM does not. Second, 
while both methods perform feature selection, PAM sets the threshold by 
cross validations (CVT), while HCT sets the threshold by Higher Criticism. 
When n is small, CVT is usually unstable. In [15], we have shown that HCT 
outperforms CVT when analyzing the three microarray data sets aforemen- 
tioned. In Section 4, we further compare HCT with CVT with simulated 
data. 
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1.12. Summary and possible extensions. We propose HCT classifier for 
two-class classification, where the major methodological innovation is the 
use of IT for feature selection and the use of HC for threshold choice. 

IT is based on an 'optimal' linear transform that maximizes SNR in all 
signal locations, and has advantages over BT and WT. IT also has a three- 
fold advantages over the well-known variable selection methods such as the 
Lasso, SCAD, and Dantzig selector: (a) IT is computationally faster, (b) IT 
is more approachable in terms of delicate analysis, and (c) the tuning pa- 
rameter of IT can be conveniently set, but how to set the tuning parameters 
of the other methods remains an open problem. 

The idea of using HC for threshold choice goes back to [15], where the 
focus is on the case where is known and is the identity matrix (see also 
[24]). In this paper, with considerable efforts, we extend the idea to the case 
where is unknown but is presumably sparse, and show that HC achieves 
the optimal phase diagram in classification. The optimality of HC is not 
coincidental, and the underlying reason is the intimate relationship between 
the HC functional and Fisher's separation. This is explained in Section 2-3 
with details. 

In Theorems 1.2-1.3 and Section 2-3, we assume the signals have the same 
signs and strengths. The first assumption is largely for simplicity and can be 
removed. The second assumption can be largely relaxed, and both Theorems 
1.2-1.3 and the intimate relationship between HC and Fisher's separation 
continue to hold to some extent if the signal strengths are unequal. One 
such example is where the signal distribution Hp^ after scaled by a factor 
of (log(p))~^/^, has a continuous density over a closed interval contained in 
(0, oo) which does not depend on p. 

In the paper, we also assume O (equivalently, the induced graph Q = 
(y, -E)) is -fC-sparse for a moderately large ET, which can also be relaxed. 
First, the main results continue to hold if there is an integer M = Mp such 
that (a) Mp < Lp, and (b) V partitions into M different subsets, and any pair 
of nodes in the same subset are not connected (but nodes in different subsets 
could be connected in an arbitrary way). Second, when Q have many small 
nonzero coordinates, we can always regularize it first with a threshold t > 0: 
= ^{h j)^{\^ih j)\ ^ t}y a-nd the main results continue to hold if il* 
is JT-sparse and the difference between two matrices is 'sufficiently small'. 

1.13. Content. The remaining part of the paper is organized as follows. 
In Section 2, we introduce two functionals: Fisher's separation and ideal HC, 
and show that the two functionals are intimately connected to each other. In 
Section 3, we derive a large-deviation bound on the empirical cdf, and then 
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use it to characterize the stochastic fluctuation of the HC functional and 
that of Fisher's separation. Theorems 1.2-1.3 are proved in the end of this 
section. Section 4 contains numeric examples. Section 5 is the proof section, 
with proofs for secondary lemmas left to the appendix. 

1.14. Notations. In this paper, C > and Lp > denote a generic 
constant and a generic multi-log(p) term respectively, which may vary from 
occurrence to occurrence. For two positive sequences {ap}^i and {bp}'^i, 
we say Op ~ bp if limp_^oo{ap/^'p} = 1 and we say Op x bp if there is a 
constant cq > 1 such that for sufficiently large p, Cq ^ < Qp/bp < cq. 

The notations Q and S are always associated with each other hy il. = 
and {Xi,Yi) represents a training sample while {X,Y) represents a 
test sample. The summarizing z-vector for the training data set is denoted 
by Z, with Z = QZ and Z = QZ, where Q is some estimate of 0. 

2. Ideal threshold and ideal HCT. In Sections 2-3, we discuss the 
behavior of HCT classifier. We limit our discussion to the ARW{/3, r, 9, 0) 
model, but the key ideas are valid beyond the ARW model and extensions 
are possible; see discussions in Section 1.10. 

The key insight behind the HCT methodology is that in a broad context, 

HCT ^ ideal HCT w ideal threshold. 

The ideal HCT is the non-stochastic counterpart of HCT, and the ideal 
threshold is the threshold one would choose if the underlying signal structure 
were known. 

In this section, we elaborate the intimate connection between the ideal 
HCT and the ideal threshold, and their connections to Fisher's separation. 
We also investigate the performance of 'ideal classifier' where we assume 
is known and the threshold is set ideally. 

The connection between HCT and ideal HCT is addressed in Section 3, 
which is new even in the case of O = Ip; compare [16]. Theorems 1.2-1.3 are 
also proved in Section 3. 

2.1. Fisher's separation and classification heuristics. Fix a threshold t > 
and let Cl be an acceptable estimator of We are interested in the classifier 
that estimates Y = ±1 according to Lt{X,Q) >< 0, where as in (1.12)- 
(1.13), 



Lt{X,n) = ififynX with fifij) = sgn(Z(i))l{|Z(j)| > t}. 
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For any fixed px 1 vector Z and pxp positive definite matrix we introduce 
Mp(t, Z, /i, A) = Mp{t, Z, fi, A; Up) = {fifYAiJ, 

and 

Vp{t,Z,A) = Vp{t,Z,A;n) = {fifYAn-^Afif, 

where loosely, "M" and 'V' stand for the mean and variance, respectively. 
In our model, given (/i, Z, Cl), the test sample X ~ N(Y ■ jj,, see (1.2) 

and note that is independent of X since it is acceptable. It follows that 

Lt{X, Q) ~ N{Y • Mp{t, Z, /i, Q), Vp{t, Z, Cl)) , 
and the misclassification error rate of Lt{X, fi) is 

'Mp{t,z,ti,n) 



(2.1) p{Y ■ Lt{x,n) <o\iJ,z,n) = <^ 



Vp{t,Z,Q) 



where ^> = 1 — <I> denotes the survival function of iV(0, 1). 

The right hand side of (2.1) is closely related to the well-known Fisher's 
separation (Sep) [1], which measures the standardized interclass distance 
Sep{t, Z, n, Cl) = Sep{t, Z, fj,, Cl; 0,p): 

(2.2) Sepit,Z,,An,p) = ^[^^(^'^)l^ = - mi^^y = . 
^ ^ i^v, , SD{Lt{X,n)) 

In fact, it is seen that Sep{t, Z , fi,Cl) = 2Mp{t, Z, fj.,Cl)/ Vp{t, Z, J7), and 
(2.1) can be rewritten as 

P{Y ■Lt{X,Cl) < 0\fi,Z,Cl) = ^{^Sep{t,Z,fi,Cl)). 



By (1.14) and (1.18), the overall misclassification error rate is then 
(2.3) P{Y-Lt{X,Q)<0) = E,^,r,E 



^{'^Sep{t,Z,fi,n)) 



where E is the expectation with respect to the law of (Z, r2|/i), and E^^^^^ is 
the expectation with respect to the law of ji; see (1.14) and (1.18). 

We introduce two proxies for Fisher's separation. Throughout this paper. 



(2.4) 



z = nz. 
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For the first proxy, recall that Z = VlZ (e.g., (1-9)). Heuristically, ^ 
and so Z Z. We expect that Sep{t, Z , fi,Cl) w Sep(t, Z , fi,Q,); the latter 
is Fisher's separation for the idealized case where il. is known and is defined 
as 

(2.5) Sep{t, Z, n) = 2Mp{t, Z, n, n)/ ^J Vp{t, z 

For the second proxy, we note that when p is large, some regularity ap- 
pears, and we expect that Mp{t, Z, fi, $7) mp{t, ep, Tp, Q,) and Vp{t, Z, 
Vp{t,€p,Tp,Q,), where 

(2-6) 

mp{t,ep,Tp,VL) = E[Mp{t,Z,fj.,n)], Vp(t,ep,Tp,n) = E[Vp{t, Z 
In light of this, a second proxy separation is the population Sep: 

Sep{t) = Sep{t, ep, r^, il) = 2mp{t, ep, Tp, n)/ Vp{t,ep,Tp,n). 
In summary, we expect to see that 

Sep{t, Z, fi, ft) Sep{t, Z, fi, il) « Sep{t, ep, Tp, 0), 

and that 

(2.7) P{Y ■Lt{X,il)<Q)^^]^S^p{t)). 

In Section 3, we solidify the above connections. But before we do that, we 
study the ideal threshold — the threshold that maximizes Sep{t). 

2.2. Ideal threshold. Ideally, one would choose t to minimize the clas- 
sification error of Lt{X,0,). In light of (2.7), this is almost equivalent to 
choosing t as the ideal threshold. 

Definition 2.1. The ideal threshold Tifieai{ep,Tp,Q) is the maximizing 
point of the second proxy: Tideai{(^p,Tp,Q) = argmax{o<(<oo}5'ep(i, ep, Tp, 0). 

In general, Sep{t,ep,Tp,Q) and Tideaii^piTp,^) may depend on 17 in a 
complicated way. Fortunately, it turns out that for large p and all O in 
M*p{a,Kp) (see (1.19)), the leading terms of Sep{t) and Tideaii^p^Tp^^) do 
not depend on the off-diagonals of O and have rather simple forms. 

Definition 2.2. (Folding). Denote ^rii) = P{\N{t,1)\ < t). When 
T = 0, we drop the .subscript and write ^(t). Also, denote = 1 — ^'r(i) 
and ^(t) = 1 - ^'(t). 
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In detail, let 



(2.8) Woit) = Wo{t,ep,Tp;^) = ep^r,{t)/J^{t) + ep^r,it), 



(2.9) i;(/3,r)=mm{2, ^}rp, 



and 



/? - r, r < /3/3, 



(2.10) 6{f3,r) = l iS+pl, (3/3<r</3, 

[ /3/2, /3 < r < 1. 

Elementary calculus shows that for large p, 

(2.11) argmax|o<t<oo}{^^o(t)} ~t;(/3,r), sup Wo{t) = Lp ■ p-^(''''\ 

{0<t<oo} 

It turns out that there is an intimate relationship between Sep{t,ep,Tp,Q) 
and Wo(t,ep,Tp), where the latter does not depend on the off-diagonals of 
ft. To see the point, we discuss the cases r < /3 and r > /3 separately. 
In the first case, for a as in Ai*{a, Kp), we let 

(2.12) co{p,r,a) = 6{f3,r)-6{p,a^r), co(/3, r, a) = ci(/3, r, a) - 5(/3, r), 
where if a < 1/3, ci{/3,r,a) = /3, and otherwise, 

f (3Q-l)r- n ^ _3-Q_^ 



1+a 8r ' l+5a 

The following lemma is proved Section 5. 



/3 <r</3. 



Lemma 2.1. Fix {l3,r,9,a) G (0,1)^ such that p*q{i3) < r < fi and (1 - 
e)/2< P <(l-e). In the ARW{P,r,0,^) model, asp-^oo, 

sup sup \p 2 5ep(t,ep,rp,17) - 2TpWo(t,ep,Tp)| < Lpp "'^^^p 2- 4 t 
t>o {QeXjCa.i^-p)} 

+ Lp [p' mm{r,i^,(l-a)(/3-ar)} ^ ^-co(/3,r,a) ^ ^-ci(/3,r,a) j ^^^^^ 

{0<t<oo} 
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Compared to the left hand side, the right hand side is much smaller and 
is negligible. Therefore, approximately, Sep{t, e^, Tp, 0.) oc Wo{t, e^, Tp) for all 
G A4*{a,Kp). Combining this with (2.11), we expect to have 

(2.13) Tideaiiep,Tp,n)^t;{/3,r), sup S^p{t,ep,Tp,n) = Lpp'-^-^'^^'^K 

0<t<oo 

Next, consider the case r > (3. The lemma below is proved in Section 5. 

Lemma 2.2. Fix {^,r,9,a) G (0,1)^ such that r > /3 and {1 - 6)/2 < 
P < (1-9). Let Ai = dolog(log(;3))/^/fo^ and A2 = 2^1og(iip logp), 
where do > is some constant. In the ARW{I3, r, 9, il.) model with Q G 
Aip{a, 6, Kp), as p ^ 00, 

(b) sup{,>,^+^^}5ep(t,ep,Tp,f]) < IrpK-^p^'-^y^ , 

(c) 2TpK-^p—^ ^sup|y2?T^_^^<t<^^}5ep(t,ep,Tp,rj) < Lpp-^ . 

A direct result of Lemma 2.2 is that, for all VL G A^*(a, 6, Kp) (see (1.19)), 

(2.14) ^/2p\og{p) < Tideal < \/2rlog(p), sup {S^p{t)} x Lpp'^^-'-P)l\ 

{0<t<oo} 

where Tj^ea/ = Tideai(.ep^p,Cl) and 5ep(t) = Sep{t,ep,Tp,Q.) for short. In 
this case, the function Sep{t) sharply increases and decreases in the intervals 
(0, Y^2/3 log(p)) and {\/2r log(p), 00), respectively, but is relatively flat in the 
interval {\/2f3 log(p), Y^2rlog(p)); in this interval, the function reaches the 
maximum but varies slowly at the magnitude of 0{Lpp^^~^~^y^). In the 
current case, on one hand, it is not critical to pin down Ti^eah Sep{t) = 
Lpp^^~^~^y^ for all t in the whole interval. On the other hand, it is hard to 
pin down Ti^^al uniformly for all Q under consideration, if possible at all. 

2.3. Ideal HCT. Ideal HCT is a counterpart of HCT and a non-stochastic 
threshold that HCT tries to estimate. Introduce a functional which is defined 
over all survival functions associated with a positive random variable: 

HC{t, G) = Vp[G(t) - ^{t)\/^jG{t){l-G{t)), t > 0. 

We are primarily interested in thresholds that are neither too small or too 
large as far as HCT concerns; see (1.10). In light of this, we introduce the 
HCT functional 



Thc{G) = argmaX|^_i(i^^j^^*|iJC(t,G), 
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where the term ^'"^(1/2) is chosen for convenience, and can be replaced by 
some other positive constants. Recah that Z = QZ and Z = CtZ (e.g., (2.4) 
and 1.9)). For any t > 0, let 



(2.15) Fp{t) = -j2l{\Z{3)\>t}, 
and 

^ 1 ^ ^ ^ 

(2.16) Fp(t) = -^l{|Z(j)|>t}, F(t) = F(t,ep,7rp,J7) = i?,^,^jFp(t)]. 

i=i 

Note that the only difference between Fp{t) and F{t) is the subscript p. 

Heuristically, for large p, we expect to have Fp{t) Fp{t) « F{t). As a 
result, we expect that 

Thc{Fp) ~ THc{Fp) ~ Thc{F), 

where TnciFp) is the HCT where is unknown and has to be estimated, 
THc{Fp) is the HCT when Q is known, and Thc{F) is a non-stochastic 
counterpart of TnciFp). 

Definition 2.3. We call Thc{F) the ideal Higher Criticism Threshold 
(ideal HCT). 

Similarly, the leading term of ideal HCT has a simple form that is easy 
to analyze. Fix 1 < j < p. Let Dj = {k : 1 < k < p, 0(j, k) 7^ 0}, and let 

1 ^ 

51 (t) = gi{t;n,ep,Tp) = - J]] P(|Z(j)| > t, /i(A:) / for some k G Dj, k^j). 

The following is a counterpart of Wo(^) defined in (2.8) and can be well 
approximated by the latter: 

(2.17) Wo(i) = Wo(t,ep,Tp,iZ) ^ 



^^'(t) + ep^',^(t) + <7i(t) 

The following lemmas are proved in Section 5. 

Lemma 2.3. Fix (/?, r, 9, a) G (0, 1)"^ such that r > p^(/3) and (1 -6)/2 < 
(3 < {I - 9). In the ARW{I3, r, 9, 17) mode/, as p 00, 

sup sup {\p~^/^HC{t,F)-Wo{t,ep,Tp,n)\}<Lpp-f^. 
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Lemma 2.4. Fix (/?, r, 9, a) G (0, 1)"^ such that r > p*g{l3) and (1 -9)12 < 
13 < {1 — 9). In the ARW{I3, r, 9, 0,) model, as p ^ oo, we have 

sup sup \Wo{t,ep,Tp,n)-Woit,ep,Tp)\<Lp[p^^+p-''°^'^^''^''^supWo 

{t>0} {Q(^M*{a,Kp)} {t>0} 

// additionally r > /3, then 

(a) snp^^^^^^y^jj^^_^^^Wo{t,ep,Tp,n) < 

(b) sup{,^^t^^yWoit,ep,Tp,n) < {^)p-^'\ 

(c) p-^l^ < sup^^;^^„^^<,<^^^ Wo{t, ep, Tp, n) < Lpp-PI\ 

where Ai = do log log(p)/-y/log(p) is defined in Lemma 2.2. 

Lemmas 2.3-2.4 say that, approximately, HC{t,F) oc Wo{t), and that two 
functions Wo{t) and Wo{t) are generally close. 

Together, Lemmas 2.1-2.4 consolidate the intimate relationship between 
the ideal threshold and the ideal HCT. To see the point, we discuss the cases 
r < /3 and r > f3 separately. 

For the first case, write Tideai = Tideaii^p, Tp, il.) and Sep{t) = Sep{t, €p, Tp, il.) 
for short as before. The following theorem is proved in Section 5. 

Theorem 2.1. Fix {P,r,9,a) G (0,1)^ such that p*g{(3) < r < (3 and 
{1-9) /2 < /? < (1-6*). In theARW{p,r,9,n) model with M*p{a,Kp), as 
p — )• oo, there is a constant c\ = ci{/3,r,a) > such that \Thc{F) — Tideai\ < 
Lpp-c,iis,r,a) ^ and so S^p{Thc{F)) ~ S^p{Tiaeai) = Lpp(i-^)/2-^('3-0. 

Consider the second case. Lemmas 2.4 says that y^2(3 log(p) < Thc{F) ^ 
Y^2r log(p). While it is hard to further elaborate how close two ideal thresh- 
olds are, in light of (2.14), classification by ideal HCT is at least "sub- 
optimal". The following theorem is proved in Section 5. 

Theorem 2.2. Fix {/3,r,9,a) G (0, 1)^ such that r ^/S and (1 -9)/2< 
/3 < {I - 9). In the ARW{(3, r, 9, a) model where O G M*{a, b, Kp), as p -?■ 
oo, we have that2TpK-^p^^-^-P)l'^ < s7p{Thc{F)) < S^p{Tideai{ep,Tp,n)) = 

^^p(l-9-/3)/2_ 

To conclude this section, we investigate the 'ideal' classifier Lt{X,Q), 
where Q is known to us. Note that for each fixed t, the misclassification 
error of Lt{X, n) is P(Y ■ Lt{X, Jl) < 0) = E.^^^^E [^{\Sep{t, Z, n, ft)] . The 
following theorem is proved in Section 5. 
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Theorem 2.3. Fix {13, r, 6, a) G (0, 1)^ such that (1 - 0)/2 < /3 < (1 - 6*) 
and r > pg(/3). In the ARW{l3,r,6,a) model with G M*{a,b,Kp), as 
p — )• oo, 

mmP{Y-LtiX,n)<0\t) = !• (^(1 + o(l)) • ^^^(Tideaz)) • 

When r < P, the condition G M*{a,b,Kp) can be relaxed to that of 
n G M*p{a,Kp). 

Combining Theorem 2.3 with Theorems 2.1-2.2, 

min P{Y ■ Lt{X,n) < 0\t) = ^(^h{t) ■ S^p{Thc{F))'J , 

where h{t) = h{t; I3,r,9,a,il:p,p) satisfies h{t) = 1/2 + o(l) when r < /3 
and h{t) = Lp when r > /3. Recall that in both cases, Sep{Tideai) = 
LpSep{THc{F)) = Lpp(i-^)/2-'^(/^''^), where the exponent (1 - e)/2 - 6{/3,r) 
is strictly positive by the assumption of r > p*q{P). Therefore, if (/3,r) fall 
in Region of Possibility and if we set t as either of the two ideal thresholds, 
then Lt{X,n) not only gives successful classification, but the classification 
error converges to very fast. 

3. Classification by HCT. In the preceding section, we have been fo- 
cused on two ideal thresholds. In this section, we study the empirical quan- 
tities, and characterize the stochastic fluctuation of HCT and Sep defined 
in (2.2). We conclude the section by proving Theorems 1.2-1.3. The main 
results in this section are new, even in the idealized case where = Ip. 

3.1. Stochastic control on the HC functional. Recall that 

HCit,Fp) = ^[Fp{t) - ^{t)]/^Fp{t){l-Fp{t)). 

When Fp{t) = 0, the above is not well-defined, and we modify the definition 
slightly by replacing Fp{t) with 1/p. The change does not affect the proof 
of the results. The stochastic fluctuation of HCT comes from that of Fp{t), 
which consists of two components: that of estimating Q and that of the data. 
This is captured in the following triangle inequality (see (2.15)-(2.16)): 

\Fp{t) - F{t)\ < \Fp{t) - F{t)\ + \Fp{t) - Fpit)\. 
Consider \Fp{t) — F{t)\ first. The key is to study 

Vp{Fp{t) - F{t))/^Fm-F{t)). 
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When Q = Ip, this is the standard uniform stochastic processes [37] and much 
is known about its stochastic fluctuation. In the more general case where 
$7 7^ Ip, it is usuafly hard to derive a tight bound on the tail probability of 
this processes. Fortunately, when Q is Kp-sparse, tight bounds are possible, 
and the key the separability of sparse graphs introduced in Lemma 1.1. 

Recall that s* = y^21og(p) (e.g., (1.10)). The following lemma is the 
direct result of Lemma 1.1 and the well-known Bennet's inequality [37], and 
is proved in Section 5. 

Lemma 3.1. Fix {/3,r,9,a) e (0,1)^ and consider an ARW{(3,r,9,i}) 
model with Q, G 7W*(a, Kp). As p ^ oo, there is a constant C > such that 
with probability at least 1 — o{p~^), for all t satisfying ^"-"^(1/2) < t < s* 



y^\Fp{t) - F{t)\/^jFm - Fit)) < CK^pilogip)) 
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Next, consider \Fp[t) — Fp{t)\. Recall that Up = p^ . By definition, if is 
an acceptable estimator of ri, then there is a constant C > such that with 
probability at least 1 — o(p~^), 

(3.1) max {|A(i,i) - 0(i,i)|} < CKpV21og(p) -p"'/'. 

{i<«,i<p} 

As a result, we have the following lemma, whose proof is straightforward 
and thus omitted. Recall that Z = QZ and Z = QZ (e.g., (1.9) and (2.4)). 

Lemma 3.2. For any acceptable estimator Cl, max|i<j<p|||Z(j) — Z(j)|} < 
CKplog{p)p^^^^ with probability at least 1 — o{l/p). 

Write for short i]p = CKplog{p)p^^^^ . By Lemma 3.2, with probability 
at least 1 - o(l/p), for ah 1 < j < p, \l{\Z{j)\ > t} - l{\Z{j)\ > t}\ < 
l{t — rjp < |^(j)| <t + rjp}. As a result, 

\Fp{t) - Fp{t)\ < Fp{t - T]p) - Fp{t + rip), 

where we note that heuristically, 

Fp{t - Vp) - Fp{t + r/p) « F{t - r]p) - F{t + r]p) « 2r]p\F'{t)\. 

Combining these, with probability at least 1 — o{l/p), for any t > ^'~^(^), 

< 2y%r^p\F'it)\/^) = 2V2p^'-'^"\F' {t)\/ ^). 

F{t){l-F{t)) 



Recall s*p = Y^21og(p). The ab ove heuristic is captured in the following 
lemma, which is proved in Section 5. 
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Lemma 3.3. Fix (/5, r, 6, a) G (0, 1)^. In the ARW{P, r, 0, Q) model with 
0, G M*{a, Kp), there exists a constant C > such that with probability at 
least 1 - o{l/p), for all t such that ^'~^(^) < t < s*, 

^\Fp{t) - Fp{t)\ . [F{t){l - F(t))]-V2 < L,max{(p(i-^)F(t))i/2, i}. 
Combining Lemmas 3.1 and 3.3, the following theorem follows directly. 

Theorem 3.1. Fix {(3,r,e,a) G (0,1)^. In the ARW{/3,r,e,n) model 
with $7 G A4*{a, Kp), as p ^ oo, with probability at least 1 — o{p~^), 

HC{t,Fp) - HCit,F)\ < Lp[{p^~'F{t)f'^ + 1], V^^^^) <t<s*p. 

By Theorem 3.1, in order for \Thc{Pp) ~ Thc{F)\ to be small, we must 
have that for all t in the vicinity of Thc{F)^ 

Lp[ip'-'F{t)y/^ + l]<^HC{t,F). 

When 6 > 1/2, this holds for ah r) in Region of Possibility. When ^ < 1/2, 
this might not hold for all (/3, r) in this region, as the estimation error of 
Cl is simply too large. This explains why we need to restrict HCT to be no 
less than Sp„ as in (1.10). This also explains that why we need Conditions 
(a)-(b) in Theorem 1.3, but we don't need such conditions in Theorem 1.2 
and Corollary 1.1. 

In the ARW{f3, r, 6, 0) model, Up = p^ . Therefore, 

s*p,n = Sp{9), if we let Sp{9) = V2max{(l - 261), 0} log(p); 

see (1.10). Accordingly, the HCT defined in (1.11) can be rewritten as 

r Thc{Fp), if SpiO) < TuciFp) < s;, 
tp^ ={ sp{e), HTHciFp) <Sp{B), 
( s;, if TnciFp) > s;. 

The main result in this section is as follows. 

Theorem 3.2. Fix (/3, r, 9, a) G (0, 1)'' such that {1 - e)/2 < /3 < I - 6 
and r > p*q{/3). In the ARW{/3,r,9,i}) model with il. G Mp{a,Kp), 

1) If6>^, then as p ^ oo, there are positive constants C2 = C2(/?, r, a, 6) 
and do = dQ{l3,r,a,6) such that with probability at least 1 — o{l/p), 
\t^^ - Tideaiiep, Tp, n)\ < Lpp-''^ when r < /3, an d t^^ G WWWp - 
Ai,rp) when r > (3, where Ai = do log (log ip))/ \/ log (p) . 
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2) If < < and {j3,r,9) satisfy the conditions in Theorem 1.3, then 
with probability at least 1 — o{l/p), \tp'^' — Tidf,ai{^p,Tp,^)\ < LpP~'^'^ 
for some constant C3 = C3(/3,r, a) > when r < j3, and t^^ G 
[\/2/3 logp — Ai, Tp) for Ai = di log(log(p))/\/Iogp when r > (3, where 
di = di{l3,r,a) > is a constant. 

3.2. Stochastic fluctuation of Fisher's separation. Similarly, the stochas- 
tic fluctuation of Sep{t, Z, /x, 0.) contains two parts: that from Z = ^IZ, and 
that from the estimation Q. In detail, 

\Sep{t,Z,fi,n) -S^p{t,ep,Tp,n)\ < 2 - (/ + //), 

where I = ^\Sep{t, Z , fi, Q) — Sep{t, ep, Tp, $7)1 and // = ^\Sep{t, Z, /i, Cl) — 

Sep{t, Z, /X, 17)1 . 

Consider / first. Recall that 



Sep{t,z,fi,n) = 2Mp{t,z,i2,n))/yVp{t,z,n). 

Heuristically, Mp{t, Z, = mp{t,ep,Tp,Q) + Op{-\/mp{t,ep^Tp^Vl)) and 

Vp{t,Z,iJ,,U) = Vp{t,ep,Tp,^}) + Op{y^Vp{t,€p,Tp,^})); see (2.6). Combining 
these with the definitions, we expect that 

(3.2) ^ ^ ^ 

Sep{t,Z,fi,n) = Sep{t,ep,Tp,n)[l+Op{ — + )], 

y'mp[t,ep,Tp,il) y'Vp[t,ep,Tp,lL) 

where in the bracket, the second term is much smaller than 1. This is elab- 
orated in the following lemma which is proved in Section 5. In detail, let 
q{t) = q{t; /3, r, 9, VLp^p) satisfy that q{t) = p(i-e)/2-max{4/3-2r,3/J+r}/4 if ^ < ^ 
and q{t) = if r > /3. 

Lemma 3.4. Fix {p,r,e,a) G (0,1)'' such thatr> p*g{/3) and {l-0)/2 < 
(3 < (1-9). In the ARW{/3,r,9,Q) model withU G Mp{a,b,Kp), as p^ 00, 
with probability at least 1 — o{l/p), 

sup \Sep{t,Z,fj.,^}) - Sep{t,ep,Tp,Q.)\ < Lp[q{t) + p^'^^'^]. 

{t>0} 

When r < j3, the condition on can be relaxed to that of £ Aip{a, Kp). 

Next, we consider //. The following lemma, which is proved in Section 5, 
characterizes the order of II. 
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Lemma 3.5. Under the same conditions as in Lemma 3.4, as p ^ oo, 
with probability at least 1 — o{l/p), for all t such that Sp{6) < t < s*, 
\Sep{t,Z,fi,n) - Sep{t,Z,fi,n)\ < Lp[p-^(pF(t))V2 + +p-e/2]_ ^,f^^^ 
r < (3, the condition on can be relaxed to that of Q £ Ai*{a,Kp). 

Combining Lemmas 3.4-3.5, we have the fohowing theorem, which is par- 
ahel to Theorem 3.1 and is proved in Section 5. 

Theorem 3.3. Under the same conditions as in Lemma 3.4, as p ^ oo, 
with probability at least 1 — o{p^^), for all t such that Sp{9) < t < s*, 

Sep{t, Z, /i, 0) - ep, Tp, 0)| < Lp\p-\pF{t)fl'' + p-'l'' + q(t)\. 

When r < P, the condition on Q can be relaxed to that of Q £ A4*{a, Kp). 

3.3. Proof of Theorems 1.2-1.3. We are now ready to prove Theorems 
1.2-1.3, where is assumed as known and unknown, respectively. The proofs 
are similar, so we only show Theorem 1.3. Consider Lhc{X-, ^^); where is 
an acceptable estimator. The misclassification error is 



(3.3) P{Y ■ Lhc{X, n)<0)= E,^^r,E 



^\sep{t^'',Z,^Ji,n) 



We now prove for the case of r < /3 and r > P separately. 

In the first case, we note that Lp[p-^(pF(t))i/2 + < i^pmm{o,^^-e} 

for Sp{e) <t < s*p. Write Tideai = Tideaii^p, Tp, n) and Sep{t) = Sep{t, ep, Tp, Q) 

for short as before. By Theorem 3.3, with probability 1 — o{l/p), 

(3.4) 

\Sep{t^^,Z,^L,n) - S^(tf^)| < Lp[p--{0'^^} i^-i'^>]. 

At the same time, by Theorem 3.2, with probability 1 — o(l/p), \tp'~^ — Ti^^ail 
is algebraically small. Note that Sep{t) is a non-stochastic function. By 
Taylor expansion and Lemma 2.1, 

(3.5) 5^(tf ^) = (1 + o{l))S^p{T,d,ai) = Lpp'^-'^M^ 

where 5{f3,r) is as in (2.10). By definitions, max{4/3 — 2r, 3/3-|-r}/4 > 5(/3, r). 
Inserting (3.3)-(3.5) into (3.3) gives 

(3.6) P{Y ■Lhc{X,Ci)<Q) = (l + o(l/p))|.(LppV-'5(A'-)^ +o(l/p), 
and the claim follows since (1 — 9)/2 — 5{/3,r) > 0. 
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In the second case, -v/2/3 logp < tp'-'' < ^/2r logp with probabihty at 
least 1 — o{l/p). Combining this with Theorem 3.3, with probabihty at least 
1 - o{l/p), 

(3.7) \Sep{t^'',Z,fi,Cl) - S^p{t^^)\ < 

At the same time, by similar argument as that of the proof of Theorem 2.2, 

2rpK-'p^'''~^y' < s^pit^^) < s^pm,,ai) = Lpp^'-'-^y\ 

Combining this with (3.3) and (3.7) gives 

(3.8) P{Y ■ Lhc{X, J7) < 0) = (1 + o{l/p))^ (lipp^'-'y'-'^^'^^) + o{l/p), 

and the claim follows since — 6{l3,r) > 0. This proves Theorem 1.3. 

We conclude this section by a remark on the convergence rate. At the 
end of Section 2, we show that the 'ideal' classifier Lt{X,Q) have very fast 
convergence rate with t being either the ideal threshold or the ideal HCT. 
In comparison, the convergence rate of Lhc{X,^) is unfortunately much 
slower (but is still algebraically fast). To explain this, we note that the rate 
of convergence of tp'~" to Thc{F) a-nd the rate of convergence of to ^2 are 
both algebraically fast; if these convergence rates can be improved, then the 
misclassification error rate of LnciX,^) can be improved as well. 

4. Simulations. We have conducted a small-scale numerical study. The 
idea is to select a few sets of representative parameters for experiments, and 
compare the performance of HCT classifier (HCT) with three other methods: 
ordinary HCT (oHCT), pseudo HCT (pHCT), and CVT. AU these methods 
are very similar to HCT, except for that (a) in pHCT, we assume is known 
to us, (b) in CVT, we set the threshold of IT by a 5-fold cross validation, and 
(c) in oHCT, we pretend S is diagonal, and estimate Q accordingly. Note 
that CVT reduces to PAM [39] if we do not utilize the correlation structure; 
see more discussion in [15]. 

4.1. Estimating $7. For some of the procedures, we need to estimate $7. 
We use Bickel and Levina's Thresholding (BLT) procedure [4] . Alternatively, 
one could use the glasso [21] or the CLIME [9]. But since the main goal is to 
investigate the performance of HCT, we do not include glasso and CLIME 
in the study: if HCT performs well with 0, estimated by BLT, we expect it 
to perform even better if 0, is estimated more accurately. 

At the same time, each of these methods can be improved numerically 
with an additional re-fitting stage. Take the BLT for example. For the train- 
ing data {{Xi,Yi)}f^^, let X = ^Eti^i^i^ and let S = ^Etii^iXi - 



28 



Y. FAN, J. JIN AND Z. YAO 



Xy{YiXi — X) be the empirical covariance matrix. BLT starts by obtaining 
an estimate of S using thresholding: 

(4.1) J:*{i,j)=t{i,j)l{\t{i,j)\>r,}, l<i,j<p, 

and then estimate $7 by Cl** = Here, > is a tuning parameter. 

We propose the following refitting stage to improve the estimator. Fix- 
ing a tuning parameter > 0, we further improve Cl** via coordinate-wise 
thresholding and call the resultant estimator 0,*: 

(4.2) h*{i,j) = n**ii,j)i{\n**{z,j)\>c}. 

For each 1 < i < p, let Si = {1 < j < p : ^ 0}, and let Ai be the 

sub-matrix of S formed by restricting the rows/columns of T, to Si. Denote 
the final estimate of by = [iOi,i02, ■ ■ ■ , <^p]. We define coi as follows. Write 
Si = {ji,j2, ■ ■ ■ ,jk}, where k = \Si\. Let be the p x 1 vector such that 
ei{j) = l{i = j}, 1 < j < p, and let be the k x 1 vector formed by 
restricting the rows of ej to Si. Define r/i = A^^^i. We let 0Ji{je) = r]i{£), 
l<£<k, and let uji{j) = if j ^ Si. 




Fig 1. Comparison of classification errors by HCT (solid), oHCT (dashed) and pHCT 
(dash- dotted). The x-axis is a, and the y-axis is the classification error (Experiment la). 



4.2. Numerical experiments. Fix (p, n, Cp, Hp, il.) and an integer m, each 
simulation experiment contains the following main steps. 

1. Generate a p x 1 vector fi according to {\/n^{j)) *~ (1 — ep)vQ + epHp. 

2. Generate training data {Xi,Yi), 1 < i < n, hy letting Yi = 1 for 
i < n/2 and = -1 for i > n/2, and Xi ~ N{Yi ■ Q.'^). 
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3. Generate m test vectors, each of which has the form of X ~ N(Y ■ 
H, 0,^^), where Y = ±1 with equal probabihties. 

4. Use the training data to build all four classifiers, apply them to the 
test set, and then record the test errors. 

When we need to estimate fi, we use BLT with the aforementioned refitting 
stage. The study contains three different experiments, which we now discuss 
separately. 

Experiment 1. In this experiment, we compare HOT with oHCT and 
pHCT. The experiment contains three sub-experiments la, lb and Ic. 

In Experiment la, we fix (^p, n, Cp, Tp, m ) = (3000,2000,0.1,4,500), and let 
Hp be the point mass at Tp. Also, we choose to be the tridiagonal matrix 

(4.3) n{i,j) = l{i=j} + a-l{\i-j\ = l}, l<i,j<p, 

where a takes values from {.05, .15, .2, .35, .4, .45}. The results are reported 
in Figure 1. The tuning parameter rj in (4.1), which varies with the values 
of a, n and p, is calculated from trials of comparing (S*)~^ with the true Cl. 
The tuning parameter ^ in (4.2), which also varies with the values of a, n 
and p, is chosen so that there are only k nonzero coordinates in each row of 
O* after thresholding of Cl** . We let A; = 2, 3 if is tridiagonal and k = 4,5 
if Q is five-diagonal (see experiments below). In this experiment, r] is set 
accordingly from {.1, .1, .15, .15, .2, .25} and C is from {.05, .1, .1, .2, .25, .3}. 
The results suggest that HOT outperforms oHCT, but is slightly inferior to 
pHCT since we have to pay a price for estimating 0,. As a increases, the 
correlation structure becomes increasingly infiuential, so the advantage of 
HOT over oHCT becomes increasingly prominent (but differences between 
HOT and pHCT remain almost the same). 

In Experiment lb, for various {p,n,€p,Tp), we choose m = 500 and let 
be either of the following tridiagonal matrix or five-diagonal matrix. In the 
first case, is a p x p tridiagonal matrix with 1 on the diagonal and a on the 
off-diagonal. In the second case, $7 is a p x p five-diagonal matrix with 1 on 
the diagonal, oi on the first off-diagonal, and 02 on the second off-diagonal. 
Experiment Ic uses a very similar setting, except that we take Hp as the 
uniform distribution over [r^ — 0.5, Tp -|- 0.5]. We select C and rj similarly as 
in experiment la. The results based on 5 repetitions for Experiment Ib-lc 
are reported in Table 1, which suggest that HOT outperforms oHCT and 
that pHCT slightly outperforms HOT. 

Experiment 2. In this experiment, we compare the pHCT with the CVT 
assuming Vt is known (the case is unknown is discussed in Experiment 3). 
Experiment 2 contains two sub-experiments, 2a and 2b. 
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n = 1000, p = 2000 


n = 2000, p = 3000 




n = 2000, p = 3000 






n, — 05 fn^ — 1 — 4 


d = .45, 6p = .2, Tp = 3 


di = 


45 n.o — 2 — 1 

■ f "-p •-'-7 'P 


= 4 


oHCT 


0.054 


0.2616 




0.17 




pHCT 


0.0448 


0.058 




0.098 




HOT 


0.052 


0.061 




0.0992 






n — 500 » — 1000 


n = 2000 T) = 3000 




n = 2000 » — 3000 






LL — .UtJ, fcp — U.-L, Ip — 4: 


LL — .'±0^ tip — .UtJ, Ip — O 


ai = 


.00, U2 — tp — .1, /p 


— 4 


oHCT 


0.0536 


0.2268 




0.1332 




pHCT 


0.046 


0.1284 




0.0912 




HOT 


0.0524 


0.1344 




0.1252 






n = 1000, p = 2000 


n = 2000, p = 3000 




n = 2000, p = 3000 






f/p = f/(3.5,4.5) 


Hp = (7(2.5,3.5) 




ffp = (7(3.5, 4.5) 






a — .05, ep — .1 


a = .45, Ep = .2, Tp = 3 


ai = 


= .45, a2 = .2, Ep = .1, Tp : 


= 4 


oHCT 


0.052 


0.2816 




0.1472 




pHCT 


0.046 


0.0704 




0.0840 




HOT 


0.044 


0.0716 




0.0891 





Table 1 

Classification errors by HCT, oHCT and pHCT. Q is tridiagonal (left two columns) or 
five-diagonal matrix (right column). Rows 1-2: Experiment lb. Row 3: Experiment Ic. 



In Experiment 2a, we consider 6 different combinations of {p,n,ep,Tp) 
with m = 500, and let be the tridiagonal matrix as in (4.3) with a = 0.2. 
Averages of the selected thresholds and classification errors across different 
replications are reported in Table 2. The results suggest that the thresh- 
old choices by HC and cross validations are considerably different, with the 
former being more accurate and more stable. Note that HCT is also com- 
putationally much more efficient than the CVT. 





Threshold 


Error 


Threshold 


Error 


Threshold 


Error 


pHCT 


1.9 


0.05 


2.16 


0.002 


1.99 





CVT 


2.5 


0.08 


1 


0.018 


1 





pHCT 


2.39 


0.18 


2.06 


0.10 


2.13 


0.02 


CVT 


1.9 


0.224 


2.00 


0.14 


1.1 


0.09 



Table 2 

Comparison of thresholds (Column 2, 4, 6) and classification errors (Column 3,5, 7) by 
pHCT and CVT. {p,rp) = (3000, 1.8), and Cp = 0.1 (top) and 0.05 (bottom). Left to 
right: n — 100, 50, 20 (Experiment 2a). 

In Experiment 2b, we set {p,ep,m) = (3000,0.05,500), n £ {20,40}, and 
let r2 be the same as in Experiment 2a. We let Tp range from 1 to 2.5 with an 
increment of 0.1. The classification errors by pHCT and CVT are in Figure 
2, where a similar conclusion can be drawn as that in Experiment 2a. 
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Fig 2. Classification errors of pHCT (solid) and CVT (dashed) for n — 20 (left) and 40 
(right) and various Tp (x-axis) (Experiment 2b). 

Experiment 3. We compare the performance of HCT with CVT for the 
case where is unknown and needs to be estimated. Note that for small n 
(say, less than 500) we might not have reasonable accuracy on estimating 
0, using BLT. For small p, say 100-300, the CVT is computationally very 
slow and it is very likely that the refitting procedure for BLT would not 
have decent performance. We take {p,n,ep) = (5, 000, 500, .1) and let Q be 
the block diagonal matrix consisting 10 diagonal blocks, each is a big five- 
diagonal matrix C = C5oo,50o(oi5 02), where C{i,j) = l{i = j} + oi • — 
j\ = 1} + a2- - j\ = 2}, 1 < i,j < 500, and ai = .45, 02 = .1. We let Tp 
range from 1 to 3 with an increment of 0.2. The tuning parameter and 77 are 
set in the similar way as in Experiment 1. The results are reported in Figure 
3. Due to high computational cost, we only conduct m = 6 repetitions, so 
the results are a bit noisy. Still, it is seen that HCT outperforms CVT. 

In summary, for a reasonably large sample size n, HCT outperforms oHCT 
and is only slightly inferior to pHCT. The reason we need a relatively large 
n is mainly due to that we need to estimate O. The relative performance of 
pHCT, HCT, and oHCT is intuitive, since pHCT utilizes the true correlation 
structure among the features, HCT estimates the correlation structure, while 
oHCT ignores it. The comparisons of pHCT with CVT in Experiments 2a-2b 
suggest that if O is known, then HCT dominates CVT. Experiment 3 shows 
that when p is several times larger than n (e.g., 10 times larger), HCT has 
smaller classification errors than CVT does, and the precision matrix Q can 
be estimated reasonably well. 

For larger p, the advantages of the HCT are even more prominent than 
those considered here. We skip the comparisons for larger p due to high 
computational cost, which mainly comes from the BLT procedure (we must 
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Fig 3. Classification errors by HCT (solid) and CVT (dashed) for various Tp (x- 
axis) (Experiment 3). 

run the algorithm many times to select a good tuning parameter rf). In the 
future, if we could find a more efficient method for estimating il, then HCT 
will be both more effective and more convenient to use for large p. 

5. Proofs. In this section, we prove all key theorems and lemmas in the 
order they appear (except for Theorem 1.2-1.3 which are proved in Section 
3.3). Secondary lemmas are proved in Section 6. 

5.1. Proof of Theorem 1.1. For short, write n = Up. Recall that the train- 
ing samples are Xi ~ N{Yiii, 0^^), 1 < i < n, where Yi S {—1, 1} are given. 
Consider an (independent) test sample X ~ N(Y ■ /i,f]^^), where Y = ±1 
with equal probabilities. Let f±i be the joint of density of (^i, . . . , Xn, X) 
in the case where Y = 1 and Y = —1, respectively, and let H{f,g) be 
the Hellinger distance between two density functions / and g. To show the 
claim, it is sufficient to show i/(/i,/„i) — t- as p — t- oo, uniformly for all 
ri G M*{a, Kp). Let /o be the joint density of {Xi, . . . ,Xn,X) in the case 
where X ~ N{0,Q~^) (but the distributions of Xi remain the same). By 
triangle inequality and symmetry, i7(/i,/_i) < i?(/i,/o) + i/(/_i,/o) = 
2H{fi, fo). Therefore, it is sufficient to show 

(5.1) H{fiJo)^0. 

Since is a Kp-spaise correlation matrix, by Lemma 1.1, there is a permu- 
tation matrix P and an integer Mp = Mp(r2, Kp) such that Mp < CKp log(p) 
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and 



(5.2) Pi}P' 



nil ■■■ ^ 



$7jV/pl • • • ^MpMp 



where on the diagonal, Clu, . . ., (lupMp are identity matrices. Since permut- 
ing the coordinates of Xi , X2 , . . . , X simultaneously does not change the 
Hellinger distance H{fi, fo), we assume P = Ip for simplicity. 

Now, corresponding to the partition of O in (5.2), we partition the mean- 
vector ij, as II = ((/x(^))', . . . , (^(^^p))')'. For < m < Mp, let Pm be the 
projection matrix such that PmfJ' = ((/^^^^)', • • • , (z^^™^)', 0, . . . , 0)', where 
generically, denotes a row vector of zeros, and let /('"^ be the joint density 
oi {Xi,. . . ,Xn,X) under the law that Xi ~ N{YiH, Q'^) for all 1 < i < n 
and X ~ N{P„^n,n-^). Note that /o = /(°) and fi = /(^^f), and that by 
triangle inequality, 

Mp 

(5.3) i/(/(0),/(^«) < H{f^"'-'\f^''^^). 

m=l 

Recalling Mp < CKplog{p) and Kp < Lp, (5.1) follows by Lemma 5.1 
below. □ 

Lemma 5.1. There is a constant cq = co{l3,r,9) > such that for any 
1 < m < Mp - I, 

(5.4) H{f^''^-'\f^^^^^)<Lpp-'^. 

5.2. Proof of Lemma 5.1. Denote K = Kp, M = Mp, and n = rip for 
short. Recall that each of X,Xi, . . . ,Xn can be partitioned into M blocks. 
We simultaneously swap the first block and the ?Ti-th block of X and of 
each Xi, but still denote the resultant vectors by X and Xi for notational 
simplicity. Denote D = D = ((/i^^^)', . . . , (/i^™"^))', 0, . . . , 0)', and fi = 

((^(1))', (^(2))', . . . , (//('"-^))', (m('"+^))', . . . , (Ai^^))')'- After the swaps, /(™) 
is the joint density of [Xi, . . . X), where the common mean vector of 
Xi,. . . ,Xn (which we still denote by fi for simplicity) is ^ = {v' , the 
mean vector of X is (z>', v')' , and the common precision matrix (still denote 
by Vt for simplicity) of Xi, . . . , Xn, X is 



(5.5) n 



Ik B 
B' D 
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where Ik is a k x k identity matrix with k = k{il., m) equals to the size 
of the m-th block (before the swaps) and D is a correlation matrix. Sim- 
ilarly, /('"~^) is the joint density of {Xi, . . . , X^, X), where the laws of 
Xi, . . . , Xn, X are the same as that of /(™) except for that the mean vector 
of X is (0, u'y instead. 

Denote for short /q = fi^-'^)^ = Since Yi are given, we assume 

Yi = 1 for notational simplicity. Consequently, Z = ^ Yll=i ^i^i reduces to 
^ ~ Sr=i -^i - definitions and elementary statistics, fo{xi, . . . ,Xn,x) = 
(j){x,il.)Il^^^(j){xi,il.) ■ I, and . . . , x^, x) = (j){x,Q)U'!j'^^(j){xi,Q) ■ II, 

where 

and F{ii) denotes the cdf of /i. Here, x and xi are p x 1 vectors, z = 
X^iLi ™d (l){x, 0) is the joint density of A^(0, 0^^). For I < i < k, de- 
note the i-th row of B in (5.5) by w^. Also, write Qx = {x, x)' and Qz = {z, z)' 
so that the lengths of x and z are k. Introduce g = g{z, jl), h = h{z, x, /i, P), 
and w = w{z, //, D) by 

g = nti [(1 - €p) + epe^f^"»-5^'-v^^''(^-^)] , 

hg = n^_^ [(1 — tp) + epe'^''^'"'"*-^''/^-'*'"^'^^"^'^^"^'^''^'^*'^-*^'''^'"'^^^'^''^^] , 
and 

Here, we have suppressed the expressions of g, h, and w as long as there is 
no confusion. Since D and /i are independent, by direct calculations, 

which, by the definitions, implies that I = J gwdF{ll). Similarly, II = 
J hgwdF{jl). 

Let A(/o,/i) and //(/o,/i) be the Hellinger affinity and the Hellinger 
distance between /q and /i, respectively. It is well-known that there is a 
universal constant C > such that 

(5.6) \l-Aifo,fi)\<C-H{foJi). 
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Let £"0 be the expectation under the law that Xi, . . . ,Xn,X are iid from 
Ar(0,J]-i). By Holder inequanty,i7(/o,/i) < Eo[{J {h-l)gwdFC^))y{J gwdPCm < 
Eo[J{h - ifgwdFifi)]. Since Eq[J hgwdF{fi)] = 1 and Eq[J gwdF{fi)] = 1, 
it is seen 

(5.7) F(/o, /i) < Eo[ J h^gwdF{tL)] - 1. 

Note that h^g does not depend on x and z, and that {x\x) is independent 
of and ~ N{B'x,D- B'B), (l|z) ~ N{B'z,D- B'B). It follows 
that E[w\{x,z)\ = exp{y/^il'B'z - ^fi'B'Bfi + D'B'i - \D'B'BD). Denote 
the right hand side hy v = v{x, z, fl,D). It follows that -Eo[/ h?'gwdF{jl)] = 
Eq[J h'^gvdFifi)]. Combining this with (5.6)-(5.7) gives 

(5.8) \l-AifoJi)\<C{Eo[j h^gvdFCfi)]-l)^C{IV-l). 

We now evaluate IV. For simplicity, we assume Hp is a point mass at r^; 
the proof for general cases is similar since the support of Hp is contained in 
[—Tp,Tp], but we need to have an extra layer of integral so the expression is 

much more cumbersome. Denote for short = (1 — Cp) and bi = 1 — Cp + 

^2 

epexp(rpZj — |- — ^/nTp{uji,fjL)), 1 <i < k. By direct calculations, 
(5.9) 



IV = En 



^2 

J V ai + bi J 



Recall that x and z are independent normal vector with as the covariance 

matrix. It follows 

(5.10) 

Eo [{a, + 6,e^^'-^"-'-^(""'))2e('^"^)^'-i(--^)'] = (a^ + + (e^ - 1)6^. 
Denote for short ^/n(iOi,jl) = diTp. By definitions and direct calculations, 
(5.11) ^^^gV^(c.„A)5.-f + 5.)] = 1^ 



and 
(5.12) 



,{2+d,)TpZ,-{2+di)^r^/2 
^2 

- e„) + e„e""p^''"^"^'"'p 
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Inserting (5.10)-(5.12) into (5.9) gives 



IV 



(5.13) 



nil e 



^y/n{u]i,jl)zi-^(wi,fi)'^ 



[a, + h + (e"'/" - 1) 



Oj + bi 



i-rfc 



ri , {2+d,)rj.z,-{2+d,fr^/2 

1 + (e^ - l)ele--E[- 



1 - ep + epC^f^'-^ 



-1 't> 



(iF(^). 



Write ^e2e^pE[e(2^f+*)^>-(2^f+*)'/V[(l-ep) + epe^f^''-^-'^»^p]] = Ai + Bi, 



where 



n ^ 



Si 



n 



ep <l(tp - Tp) 



and tp = [(r + /3)/(2r)]Tp. First, by Mills' ratio [41], Ai < Lpp-2/5+2''-^ 
Second, for Bi, noting that tp/Tp > 1 in the range of interest, so Bi < 
Lpp~^^~^^^ /(4r)-6»_ gy Q^j. assumptions, there is a constant cq = co{(3,r,9) > 
such that min{2/3 — 2r + 0, ^^^J'^ — \- 6} > 1 + cq. Combining these gives 

-2 \ r {2Tp+di)zi-{2Tp+dif/2 



(5.14) 



n ^ 



1 - Cp + epc'^p^' ^ '''^f 



< Lpp 



-(l+co) 



Inserting (5.14) into (5.13), IV < 1+p '^'■K Inserting this into (5.8) gives the 
claim. □ 



5.3. Proof of Lemma 1.1. We define Rq,Ri, . . . , Rm recursively as fol- 
lows: (a) Let Rq = 0. (b). Given i?o, . . . ,-Rm-i, let Rm C {1,2, . . . ,p}\(i?oU 
. . .Ui?m-i) be the subset the size of which is as large as possible and satisfies 
that Q{k,i) = for any two different indices k € Rm and i € Rm (if there 
are more than one such subsets, pick any one). The process is repeated until 
no index is left. Clearly, the constructed Ri, R2, ■ ■ ■ , Rm satisfy the second 
claim, and all remains to show is that M < CKplog{p). 

For m = 0, 1, . . . , M, let Sm. = \Rm\- The key to the proof is that for all 
< m < M, 

(5.15) Sm+i > max{l, {p - Sq - si - . . . - Sm)}- 

Ap + 1 

Since the proofs are similar, we only show the case m = 0. Let Ri = 
{ii,i2, ■ ■ ■ ,isi}, and Dj, = {I < i < p : i ^ ik,^{ik,i) / 0},1 < A; < si. 
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By the assumption of Kp-sparse, \Dk\ < Kp, and so \Ri U {Ui<k<siDk) \ < 
l-^il+Ei<fc<^iil^fc| - sil-f^p+l)-^ (5.15) does not hold, then < p, 

and there is an index j* ^ i?i U (Ui<fe<s^Dfe). Let R[ = {j*} U Ri. It is seen 
that Q{j,k) = for any j,k G i?^ and j 7^ /c, which contradicts with the 
definition of Ri. This shows that si{Kp + 1) > p and (5.15) follows. 

Next, let mo < M be the largest indices such that si + . . . + Smo < 
p — Kp — 1. We claim that for all 1 < m < mo, 

(5.16) (« + « + ... + .„) > ^ .p[i-(_^n. 

It suffices to show the first inequality. We show this by mathematical in- 
duction. First, by (5.15), this is true for m = 1. Second, if this holds for 

m — 1, then {si + S2 + ■ ■ ■ Sm-i) > PX]j=i (^K^'+iy ' same time, by 

(5.15), Si + . . . + Sm > (si + . . . + Sm-l) + K^iP - [si + . . . + Sm-l]) = 

j^^j^i + j^^i {si + . . . + Sm-l)- Combining these with basic algebra, the 
inequality holds for m and the claim follows. 

By (5.16), 1 - {Kp/{Kp + l))'"o < {p - Kp - I) /p. Therefore, mg < 
log( /<^+i )/lQg(l + 7^) < ^^(-fi^p + l)log(p). Also, by the way the sets are 
constructed, M < niQ + Kp + 1. Combining these gives the claim. □ 

5.4. Proof of Lemmas 2.1-2.2. Before we prove these two lemmas, we 
need some preparations. Recall that Dj = {k : \ < k < p,Q(j,k) ^ 0} 
for 1 < j < p. Introduce events Aqj = {fJ-{k) = 0, VA; G Dj}^ ^ij = 
{/u(A;) / for exactly one k G Dj}, and A2j = {fJ-ik) / for some k G Dj, k / 
j}. Let /2 = ri/i. It is seen that 

• Over the event A^j, = 0. 

• Over the event Aij n {/i(j) / 0}, ./n^fiij) = y/n^lJi{j) = Tp. 

• Over the event Aij n {/x(j) = 0}, y^|/i(j)| < oTp. 

Let ho{t) = ho{t,ep,Tp,n) = p-^Yl%iPi\ZU)\ > t;Aoj), /i+(t) = hf {t,ep,Tp,Q) = 

p-' Yfj=i nz{j) > t; AijCiMj) / 0}), (t) = {t, Ep, Tp, n) = P{z{j) < 

-t-A,, n / 0}), and g^it) = ^ E?=i ^[/i(i)sgn(^(j)) • l{\Z{j)\ > 

t}\A2j]P{A2j). Further, recah that gi{t) = | Ei=i ^(l^(i)l > *>^2j). By 

definitions, it follows that 

(5-17) 

F{t) = ho{t) + ht{t) + h^{t)+gi{t), mp{t) = 7i-^/^pTp{h+{t)-h^{t)+g2{t)). 

Lemma 5.2 below summarizes some basic properties of these quantities, the 
proof of which is elementary so we omit it. 
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Lemma 5.2. For any t > 0, we have (a) (1 - Kpep)^{t) < ho{t) < ^{t), 
(b) (1 - Kpep)ep^t - Tp) < h+{t) < ep^{t - Tp), (1 - Kpep)ep^t + Tp) < 

Kit) < ep^{t + Tp), (c) < gi{t) < Kpep^ar,{t) + (i^pep)^^(l+a)rp W + 

C{Kpep)'^, (d) < g2{t) < Kpgi{t), and (e) (1 - Kpep){^{t) + ep^r^{t)) < 
F{t). 

Next, the following lemma is proved in the appendix. 

Lemma 5.3. Fix a G (0, 1) and r > 0. Let {X,Y) be a bivariate normal 
distribution with mean vector (0,r)', variance one and correlation p. Then 
there is a constant C = C{a) > such that for all p € [—a, a], P(|X| > 

t\\Y\>t)<C{l + t)eM-^)- 

By Lemma 5.3, we have the following lemma which is proved in Section 
5.9. 

Lemma 5.4. For any t > 0, we can write Vp{t) = p^F^t) + rem{t)), 
where the reminder term rem[t) / F[t) can be bounded from above by 



+ Lp{l + t) exp ( - -^(x^) , r < P and t < Tp + Sp, 

r > f3 or t > Tp + Sp, 




where Sp = ^ max{2(/3 — r), (/? + r)} logp. Moreover, when r < f3 and t < 
Tp + Sp, we have Vp{t) / (pF (t)) > 1 — o(l). In addition, if the smallest eigen- 
value of 0, is bounded from below by b > 0, then Vp{t) / \pF {t)] > b. 

Recall that in (2.17) and (2.8), we defined Wo{t) and its proxy ^0(^)5 
respectively. Define a{t) = ^(Wo{t))-^[hf (t) + h]; (t) + gi{t)]ivp{t))-^/^ 
and Slit) = ivpit))-^/^[^ig2it)-giit)-2hYit))]. Then 5^(t, e^, r^, O) = 
2Tpy/p/ np[ait)Woit) + 'S'i(i)]. The following two lemmas are proved in Sec- 
tion 5.7 and 5.8. 

Lemma 5.5. Fix (/3,r) G (0, 1)^ and U G Mpia,Kp). Then 
(5.19) _ 

sup |5i(t)| < Lp(p- 3/3/2 +p-(/3+r)^ ^_^^p-co(/3,r,a) 
{0<t<Tp+Sp} {0<t<oo} 

where co(/3,r, a) is defined in (2.12) and Sp is defined in Lemma 5.4- If in 
addition Q G Ai*ia,b, Kp), then the above inequality holds with the left hand 
side replaced with sup|^->o} l'S'i(i)|- 
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Also, ifr <p andt< Tp+Sp, then \a{t)-l\ < Lpp^^^'^{r,^X^-^)iP~'^r)} ^ 

Lp{l + t) exp ( — '^2{i+l) ) ' ^'^^ addition J7 G A^p(a, 6, -fi^p); then Kp < 
a(t)< 6-1/2. 

Lemma 5.6. Fix {r,P) G (0, 1)^. Then 

3/3/2 I o c,,^ KpCp^arpit) 



(5.20) sup |VFo(t) - VFo(t)| < Lpp^-""'^ + 2 sup - _ 

= + sup Woit), 

{t>0} 

where co{f3,r,a) is defined in (2.12). 

We now prove Lemma 2.1 and Lemma 2.2 separately. 

5.5. Proof of Lemma 2.1. Write for short Sep{t) = Sep{t,ep,Tp,Q). We 
consider the two cases 1) t > Tp + Sp and 2) t < Tp + Sp separately, where Sp 
is as in Lemma 5.4. 

First consider case 1). We will show that (la) Sep{t) < Lpp^ maxj/^-^r-,— 

and (lb) Wait) < Lpp-""^^-^^-^^'^^. Then combining (la) and (lb) com- 
pletes the proof of the lemma in case 1). We now proceed to prove (la) and 
(lb). The result (lb) follows immediately from the definition of W(){t) and 



the inequalities Wo{t) < yjep'^rj,{t) < Lpp-™^^i4/3-2r,3/3+r}/4_ remains to 

prove (la). Let rj he a p x 1 vector such that r]{j) = l{{Qflf){j) / 0}, 
1 < J < Also, for any p x 1 vectors x and y, let x o y be the p x 1 vec- 
tor such that {x o y){j) = x{j)y{j), I < j < p. By definition, it is seen that 
mp{t) = E[Mp{t)] = E[{fifynn] = E[{fifyn{fior])]. Using Cauchy-Schwartz 
inequality, mp{t) < {E[{flfynflf])'^^'^ {E{{noriyn{fior])])^^^ . Recalling that 
Vp{t) = E[Vp{t)] = E[{fi^ynfif], it follows that 

(5.21) \S^p{t)\=2mp{t){vp{t))-'/^ <2{E[{fior,yn{fion)])^/\ 

Since the largest eigenvalue of Q is no greater than Kp, the last term above 
< 2Kp^'^{E\\por]\\^)^/^ and so \Sep{t)\ < 2Kp^^{E\\fior]\\'^y/'^. It remains to 
study i^ll^ o By definition, 



p ^2 _ ^2 p 



%=\ 1=1 jeDi 



^2 P 

p 



E ^(/^(^) ^ 0' ^ ^ Lpp'~'{ep^rAt) + ^p^ar,{t) + CKpel). 
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Since we consider the range t > Tp -\- Sp, the above expectation can be 
bounded as E\\fiorif < Lppi-^->^a^{4/3-2r,3/3+r}/2^ Inserting this into (5.21) 
we complete the proof of (la). 

Now we consider the case 2). Recall that Sep{t) = 2Tpy'p/np[a{t)Wo{t) + 
Si{t)]. Noting that Up = p^, the key is to show 



(5.22) 



{2Tpr'p^'~'y^Sep{t) - Wo{t) < Lpp-^l"'^ + Lpp-P- 



sup 

{0<t<rp+Sp} 

^ ^ {t>0} ^ 

In fact, once this is proved, the claim follows by using Lemma 5.6. 
We now show (5.22). By Lemma 5.5, 

(5.23) sup|o<t<.,+,,} \p^'~^'^'\2Tpr^s7p{t) - Wo{t)\ 

< sup|o<j<rp+Sp} \a{t) - l|Wo(t) + sup|o<t<^^+^-p} |5'i(t)|. 

The second term on the right was studied in Lemma 5.5 inequality (5.19). 
We now study the first term on the right. By lemma 5.5, 



(5.24) sup|o<t<^p+g^} \a{t) - l|Wo(t) < sup|t>o} h{t) + sup|i>o} hit), 

1— g 
2(l+a) 



where h{t) = ^^(p- '"i'^i^^' + c{l + t) exp(- 3^^*^)) Wo(i), 



hit) = Lp\Wo{t) -Wo{t)\. 
Consider h{t) first. By Lemma 5.6 and Lemma 5.5, 

(5.25) SUP|,>o} h{t) < Lp{p-^P'^ + p--0(/3,r.,a) sup|o<t<oo} W^{t)) . 

Consider Ii{t) next. Write Ii{t) = ha{t) + hb{t), where ha{t) = Lp- 
We first study luit). By definitions and elementary algebra, 

sup {(l+i)exp(-ij— ^t2)^^(^)|^^^^-c,(/3,r,a) 
{0<t<oo} + "J {0<t<oo} 

where ci(/3, r, a) is defined in (2.12). Combining these results and comparing 
terms yields 



(5.26) supli(t) < Lj,(p-™'^^^'^'(i~")('^""'')}+p^^i(^'''''^n sup Wo{t). 

t>0 ^ ' {0<i<oo} 
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Combing (5.26) and (5.25) with (5.24) yields 
sup \a{t) - l|Wo(t) < Lpp^'^^/'^ 

{0<i<rp+Sp} 

_|_ /'p- mining, (l-a)(/3-ar-)} _|_p-co(/3,r,a) _|_^-ci(/3,r,a)\ WQ{t). 
^ ' {0<t<oo} 

Inserting this and (5.19) into (5.23) completes the proof of the lemma when 

t < Tp + Sp. □ 

5.6. Proof of Lemma 2.2. First, we consider (a)-(b). By Lemma 5.5, 
{2TpY^ .J^^JpSep{t) < b~^/^Wo{t) + Si{t), where Wo{t) is defined in (2.17), 
and Si{t) is as in Lemma 5.5. The key is to prove that there is a constant 
do > such that for any fixed t satisfying either < t < \^2f3 logp — 
do log logp/ Vlogp or t> Tp + 2y/log{Kplogp), 

(5-27) Wo{t)< 5,(0 < 

In fact, once these are proved, then 

(5.28) S^pit) < 2r,p(i"^)/2[6"V2T^o(t) + S^{t)] < ^r,i^; V^^'-^)/^ 

and parts (a)-(b) of the lemma follow. 

We now show (5.27). Recall that by the proof of Lemmas 5.5-5.6, 

3/3/2 , , CKpEp^aTpit) 



(5.29) \S^{t)\<Lp{p-'^''+p-^-'') + 



l^{t) + Kpep^arp{t) 

(5.30) < Wo(t) - Wo(t) < Lpp-'-'PI'' + 



-3/3/2 , CKptp^aTpit) 



^{t)+Kpep^arp{t) 



note that the last terms in the above two inequalities are the same. We 
now consider the case t < ^/2JT\ogp — do log logp/ y/logp and the case t > 
Tp + 2Y^log(Kp logp) separately. 

In the first case, by Mills's ratio [41], with the constant do > being 
appropriately chosen, ^{t) + KpCp^arpit) > dC^b"^ K^{\ogp)'^€p and ^{t) + 
^P^Tpit) > 9b~^Kp€p. As a result, 

CKpep^arpit) ^/Wp ~ _ ep^Tp{t) ^ ^/bTp 



l^>{t)+Kpep^>arp{t) ^^P^^'^P J^>{t)+ep^rp{t) 
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Inserting these into (5.29) and (5.30), the claim follows by noting that ep = 
Consider the second case. In this case, ep^arpif) = o{epp~^^~°^ Thus, 

^"'^^"^-^'^ < jKpep^ar^t) = o{Kp\logpr'V^,), 

^{t)+Kpep^ar,{t) 



Wo{t) = ^ : < Jep^^^it) < ^bep/iSK, 

/^^>{t) + ep^r^t) 



and 



Inserting these into (5.29) and (5.30) proves (5.27), the claim follows by 
similar reasons. 

Next, consider (c). Write for short Sp = ^2fi logp — do log log p / \/\og p . 
Since the eigenvalue of Vt is bounded from above by Kp, by definition we have 

Vp{t) < KppF{t). Thus, Sep{t) = 2mp{t) / ^ Vp{t) > 2Kp^^'^mp{t) / \J pF{t). 
By definitions in (5.17) and Lemma 5.2 we can further obtain that 

> 2TpP^(/i+(t)-/i^(t)) ^ 2TpP V [(1 - Kpep)ep^{t - Tp) - ep^t + Tp)] 



KpF{t) \/Kp{^{t) + ep^r,{t) + Kpep^ar,{t) + C{KpepY) 

When Sp < t < Tp, the numerator above ~ 2Tpp~i ^, and the denomina- 
tor above < Kpp~^. Thus, 5ep(t) > 2TpKp^p^^-'^^^'>/'^ . On the other hand, 
recall that sup^->o^o(0 = Lpp~^/'^ when r > f3, which together with Lem- 
mas 5.5-5.6 ensures sup4>o^o(*) < Lpp~^/'^ and sup^^g 'S'i(i) < Lpp^^f^. 
Since {2Tp)~^ ^/np/pSep{t) < b~^/'^Wo{t) + Si{t), combining these entails 
Sep{t) < Lpp(^^^~^)/^. This completes the proof of part (c). □ 

5.7. Proof of Lemma 5.6. Recall that Woit) = ^^^'^^p^+g^^^) ^here 

gi{t) is as in Lemma 5.2. We will compare Wo{t) with Wo{t) defined in (2.8). 
On one hand, since {A + x)/\/S~+~r is an increasing function of x when 
< A < i?, it is seen that Wo(^) ^ ^o(i)- On the other hand, writing for 
short b{t) = KpCp^aTpit) + (KpCp)'^^ (^ij^a)Tp{'t) ^ it follows from Lemma 5.2(c) 
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that 
(5.31) 



'^^>{t) + ep^r,{t) + bit) + CK|e| 

'^{t) + Kpep^ar,{t) Jep^r,{t) + b{t) 



Combining these and recahing ep = p ^, we have 



-3/3/2 

0<t<OO 

where 



sup \Wo{t) - Wo{t)\ < Lpp-'^f'^ + I + II, 

0<t<oo 



Kpep^ar^it) K^4^(^+^) (t) 

1= sup 11= sup I _ 

0<t<oo ^^(i) + Kpep^ar,{t) 0<*<- ^ep-^r.Ct) + K*) 

To show the first inequahty of claim, it is sufficient to show 
(5.32) 

// < Lpp-^^'^+Lpp-P/^ sup Kpep^ar,{t) ^ ^ Lpp~^P/^+Lpp-P/^-I. 

0<t<co ^^>{t) + Kpep^ar,{t) 

Towards this end, we write // < Ila + lib, where I la and lib are the 



supremum oi Kpe^^ (i^a)Ty{t) / ^ ep'^ rp{t) + b{t) over the intervals < t < 
and Tp < t < oo, respectively. Consider Ila. When < t < Tp, ^Tp{t) > 

1/2, and so Ila < K^elsup^o<t<r,} ^J^ffjf < ^p^p^'- Consider lib. By 
definitions and change-of-variable, and recalling €p = p 



lib < sup — _ I = sup 



< Lpey' . I = Lpp^PI^ . I. 

Combining these proves (5.32). Consequently, the first inequality of the claim 
follows. 

To show the second inequality in the claim, we use similar calculations as 
in [16] and get 

sup {Wo{t)} = Lpp-^'-''^^ I = Lpp-^'-'"''^''^^ = Lpp-'^^^''''''^ sup Wo{t), 

{0<i<oo} 0<t<oo 

where we have used co(r, /3,a) = 6{r,(3) — 6{a'^r,/3) as in (2.10). □ 
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5.8. Proof of Lemma 5.5. Consider the first claim. By Lemma 5.2 (part 
(d)), 152(^)1 < Kpgi{t). So by definitions, 

(5.33) \Sm < [K, + 1)^^ + ^ {K, + l)Bo{t) + B,it). 



Consider Bo{t) first. Rewrite Boit) = [gi{t) / F{t)]yJ pF{t) /vp{t). Note 
that when r < (3 and t < Tp + Sp, pF{t) /vp{t) < 1, and when r > j3 
and $7 G M*{a,b,Kp), by the last claim of Lemma 5.4, pF{t)/vp{t) < b~^. 
This says that pF{t)/vp{t) < C for some generic constant C > and so 



Bo{t) < C gi{t) / y F {t) . At the same time, by definitions and Lemma 5.2, 

F{t) = ho{t) + h+it) + h^it) + gi{t) > (1 - i^pep)[^'(t) + ep^rM + (7i(t), 
so we have 



Bo{t) < Cgi{t)/^^{t) + ep^r,{t)+gi{t). 

Finally, using Lemma 5.2 and noting that xj^/ x is an increasing function 
in a; £ (0, oo) for any number ^ > 0, we obtain 

„ ^ C[Kpep^ar,[t) + {Kpepf^(^^,y^{t) + [Kp^pf) 

Bo[t) < 



where the right hand side < I + II + C{Kpep)^^'^ , with 

CKpep^ar^jt) C{Kpepf^^,+a)r,{t) 
^^{t) + Kpep^ar,{t)' Jep^r.it) + KpEp^ ar,{t) + {KpEpY^ (i+a)r,{t) 



The above two terms have been considered in Lemma 5.6 (see the last two 
terms of (5.31)). Using the results over there we can show that 

(5.34) sup Bo{t) < Lpp-^f^/^ + Lpp-'^^f^''^'"'^ sup Wo{t). 

{0<t<Sp} {0<i<oo} 

Next we consider Si (t)^^ Write Bi{t) = 2-[{pF{t)/vp{t)y/^]-[h^ {t){F{t))~^/'^]. 
We have just proved pF{t)/vp{t) < C when r > f3 or < t < Tp + Sp 
with C > some generic constant. At the same time, using (5.17) and 
parts (a)-(b) of Lemma 5.2, first, h^{t) < ep^{t + Tp), and second, F{t) > 
ho{t) + hf{t) + h];{t) > (1 - Kpep)[^{t) + ep^rp{t)]. Combining these gives 

/i7(t)(F(i))-i/2 < Cep^it + Tp)/J^{t) + ep^rp{t). It follows that Bi{t) < 
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Cep^{t+Tp) / ^ ^{t) + ep^rp{t). This together with direct calculations yields 
(5.35) sup Bi{t)<Cep sup ^ ^^(t + Tp) ^ ^ (j^-jp+r) 



0<t<sp 0<f<oo ^^(i) + gp^^^ (i) 

Inserting (5.34) and (5.35) into (5.33) completes the proof. 
Consider the last two claims. Write a{t) = Ai ■ A2 ■ A^, where 

^ htit) + h^it)+giit) ^ / ^(t)+gp^,^(t)+gi(t) 1/2 
' ep^,^{t)+g,{t) ' ' I Fit) ' ' 

and As = {pF{t) /vpit))^'"^ . First, by Lemma 5.2 (part (b)), ep{l-Kpep)^rp{t) < 
^li't) + ^r(0 — ^p'^T-p(i) and thus 1 — KpGp < ^1 < 1. Second, simi- 
larly, by Lemma 5.2, 1 < ^2 ^ (1 — Kp€p)^^^^ . Since by basis algebra, 
\AB - 1| < \A - 1\ + \B - 1\ + \ A - l\\B - 1| for any numbers A and B, 
we have \a{t) - 1| < CKpep{l + [A^ - 1|) + |^3 - 1|. Now, by Lemma 5.4, 

1^3 -1| < i.p(p~"''°^'"'^'^^"")(^""'')^ + (l + i)exp(-^(=^)) whenr < /3 

and <t < Tp + Sp, and K'^^^ < A^^t) < ^^/^ when S7 G M*p{a,b,Kp), 
and so the claim follows. □ 



5.9. Proof of Lemma 5.4- The last claim follows trivially from the as- 
sumption on the minimum eigenvalue of 0,. And in the case of r > /3, by 
definition of Vp{t) and noting that the maximum eigenvalue of 17 is bounded 
by Kp, we obtain that Vp{t) < KppF(t). So we only need to prove the first 
claim in the case of r < /? and the second claim. 

Consider the first claim. Let Di = {j : ^ 0} and Di = Di \ 

{i}. Write h{t) = ho{t) + hi{t), where h{t) = P"^ ELi Ejezj, > 
t,\Z{j)\ > t), ho{t) = p-'Ei,jeD,P{\Z{i)\ > t,\Z{j)\ >t,fl^ = OoT fij = 

0), hit) = p-'Eij<,D,Pi\Z{i)\ > t,\Z{j)\ > t,jii / OandAj / 0). By 
definitions, it is seen that 

(5.36) Vp{t) = p{F{t) + rem{t)), where \rem{t)\ < h(t) = ho{t) + hi{t). 

To show the claim, it is sufficient to show that the ratio [/io(i) + hi{t)]/F{t) 
does not exceed the right hand side of (5.18). 

First, consider /io(i)- If at least one of Z{i) and Z{j) has mean 0, by 
Lemma 5.3 and definitions, P{\Z{i)\ > t, \Z{j)\ < t,flj = or jlj = 0) < 
CKp{l + t)exp(-^(=g^)(P(|Z(i)| >t) + P{\Z{i)\ > t)). Since A has at 
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most Kp components, it follows from the definition of F{t) that 

(5.37) 

ho{t)<cKp{i+t)eM-^^^^^)p^' E {p{\m\>t)+p{\m\>t)) 

<CK^p{l+t)e.p{-^^^)m. 

Next, consider hi{t). Define events = {fJ'ik) ^ for some k £ Di\ 
Dj}, ^2,ij = {m(^) 7^ for exactly one k, which is in Di n Dj}, and ^3,ij = 
7^ for two or more k, all of which are in Di Ci Dj}. It is seen that 

where = P"^ E^eA ^d^WI ^ > i,^MiU^iji), /ii,2W = 

P^'E^eD.^d^WI > > t,A2,»,) ,and/ii,3(t) = P~' EijeD, Pi\Z{i)\ > 

t,\Z{j)\ >t,As^ij). ^ 

We first consider /ii^i(t). Note that 

P{\Z{i)\ >t,\Z{j)\ >t,Ai,.,U^ij,) 

<P(|Z(i)| >t,|ZO-)l >^,^l,i^) + ^(I^WI >i,l^(j)l >i,^i,*j) 

< P{\Z{i)\ > t,Ai,,i)+P{\Z{j)\ > t,Ai,ij) < Kpep[P{\Z{i)\ > t) + P{\Z{j)\ > t)]. 

Thus, hi,i{t) < 2epK^pP~^YJl=iP{\Z{^\ >t) = LpepF{t). 

Now we consider hi^2{i)- For any G ^2,ij5 we use {Z* (i) , Z* (j)) to 
denote the demeaned pair of {Z{i),Z{j)). By definition there exists a /c 
such that ^Jn^^{k) = Tp, jx{i) = k)i_t{k) and = , k) fi{k) . Thus, 
|y^/i(z)| < aTp or |y^/"(j)I < O'^p and 

P(|Z(i)| > t, \Z{j)\ > t,A2,ij) < KpepPi\Z*{i)\ > t - aTp) = Kpep^ar.it). 
Then ^i,2(t) < Kpep^arpit)- Direct calculations yield 
(5.38) 

By Lemma 5.4 F{t) > ^{t)+€p^rpit), it follows that /ii,2(t) < Lp^j-(i-'^)(/^-'^'^)F(t). 
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Now, consider /ii,3(t). Observe that /ii,3(t) < p ^ Yli jeo ^i^^'SAj) — 
Kp{Kpep)^ . By Lemma 5.2, 

hijjt) ^ 1 KpjKpepf) ^ CKlel 
F{t) ~ 1 - Kpep ^{t) + ep'l',^(t) " ^-l*) + ep^r,{t) ' 

When r < ^ and t < Tp+Sp, we have ^{t)+ep^r^{t) > Lpp- max{4/3-2r,3/3+r}/2^ 
and thus CK^el/[^{t) + ep^rp{t)] < Lpip-^^-''^^ + p-'). When t > Tp + Sp, 
by the definition of Vp{t) and recahing that the largest eigenvalue of is 
bounded by Kp, we have Vp{t) < KppF{t). Combining these together and 
noting that F{t) > ^{t)+ep^rpit), we obtain hi^3{t)/F{t) < Kpift> Tp+Sp, 
and hi^s{t)/F{t) < Lp{p~(>^~^y^ + p-'} if t < Tp + Sp. 

Combining the bounds on /ii,2(i) and /ii,3(t) entails that when 

r < I3^hi{t)/F{t) < p-(/3-'-)/2 +p^r ^ p-{l-a){p-ar) if f < Tp + Sp and 

hi{t)/F{t) < Kp \i t > Tp + Sp. These together with (5.36) and (5.37) 
completes the proof of the first claim when r < (3. 

Next, we consider the second claim. The goal is to show that Vp{t) / {pF{t)) > 
1, assuming r < (3 and t < Tp + Sp. We consider the cases (a) dsloglogp < 
t < Tp + Sp and (b) t < ^3 log log p separately, where ^3 > is a large 
constant. 

In Case (a), using (5.37), it is seen that \rem{t)\/F{t)\ = o(l), uniformly 
for all d3loglogp < t < Tp + Sp. Using (5.36), \vp{t) / {pF {t)) — 1| = o(l) and 
the claim follows. 

In Case (b), recaU that Vp{t) = £;[(/if )'J^/xf ], where fif{j) = sgn(Z(j))l{|^(j)| > 
t} and Z = nZ. Write Z = + W, where /i = and ~ N{0, O). 

Let fit be the counterpart of fif defined by fit{j) = sgn(l^(j))l{|iy(j)| > 
t}. We claim (bl) E[{fifynfif] = EKfuYnfit] + 0{Lpp^~^/^) and (b2) 
EKfitYflfit] > pF{t). The claim follows by combining (bl) and (b2) and 
noting that pF{t) > Lpp{l — Kp€p) when t < loglogp. 

Consider (bl). Let S = {1 < i < p : /if (z) / Ai(^)}- Note that for 
all p X 1 vectors ^ and r], by Schwartz inequality and that the spectral 
norm of < Kp, |(^ + r?)'J7(e + rj) - r/'fir?! < + 2[{^'nC) ■ (r/'Or/)]^^ < 

-^p[ll?lP + IICIIII^II]- Applying this with r] = fit, C = ff - f-u and noting 
that each coordinate of fit — fit has magnitude no greater than 2, we claim 
that \E[{fil)'niif] - E[{fit)'^fH]\ < LpE[\S\ + < LpE[y^\]. Note 

that for any i £ S, we must have fi{i) 7^ 0. Therefore, by definitions, |5| < 

EUnmii) / 0} < ELiErm,,)^oHKj) / 0} < KpEUHK^) / 

0}, where we have used the assumption that is i^p-sparse. Note that 
Si=i / 0} ~ Binomial(p, ep), where ep = p~^, so £^[a/p|5'|] ~ p^~^/'^. 
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Combining these gives (bl). 

Consider (b2). Denoting B = E[fltfif], we have E[{flty0.flt] = E[Tlfitfi't] = 
tr{QB). We claim that for any i ^ j such that ^ 0, B(i,j) has the 

same sign as that of j). To see the point, write B{i,j) = E[sgn{Z {i))sgn{Z 
l{\Z{i\ > t,\Z{j)\ > i}. By symmetry and basic statistics, = 2[P{Z{j) 

t,Z{j) > t\n{i,j)) -P{Z{i) > t,Z{j) > t\- where for any p G 

(— 1, 1), P{Z{i) > t, Z{j) > t\p) is evaluated at the law that corr(Z(z), Z{j)) = 
p. The claim follows by noting that for any p > 0, P{Z{j) > t, Z{j) > 
t\p) > P{Z{i) > t)P{Z{j) > t) > P{Z{i) > t,Z{j) > t\- p). As a result, 
tr{QB) > tr[B) = pF{t), where we have used the fact that the diagonals of 
Q are ones. This proves (b2). □ 

5.10. Proof of Lemmas 2.3-24. Write for short W{t) = p~^l'^HC{t,F). 
Recalling Wo{t) = [ep^r^{t) + gi{t)]/ ^^{t) + ep^r^{t) + gi{t) as defined in 

(2.17), where gi {t) is as in Lemma 5 .2 , we let oi (t) = ( Wq (t)) " ^ [F{t) - ho (t)] ■ 
(F(t)(l and Wiit) = [^{t)-hoit)] ■ iF{t){l - F{t))-^/\ where 

ho{t) is as in Lemma 5.2. By these notations, W{t) = ai{t)Wo{t) — Wi{t). 
The following Lemma is proved in Section 5.11. 

Lemma 5.7. Fix a sufficiently large p. There is a universal constant 
C > such that for all Q G M*{a, Kp), 

(5.39) 

< Wi{t) < CKpep^{t)/^^{t) + ep^r,{t), for all t > ^-\l/2) 
(5.40) 

1 - CKpCp < ai{t) < (1 + CKpep){l - ^{t) - Kpep)'^''^, for all t > 0. 

Consider Lemma 2.3. Using Lemma 5.7, |ai(t) — 1| < C{Kpep + ^{t)) for 
all t > 0. Recalling W(t) = ai(t)Wo(t) - Wi(t), we have 

(5.41) 

sup \W{t)-Wo{t)\< sup {\ai{t)-l\Wo{t)}+ sup Wi{t) 

{j>^-i(i)} {t>0} {t>|,-i(i)} 

< Lp{I + 11 + III), 

where I = Kp€pSup^t>Qy{Wo{t)}, II = sup|i>o|{^(t)H^o(t)}, and /// = 
{Wiit)}. 

First, consider I. By basic algebra and Lemma 5.6, 



/ < Lpep[snp Wo{t) + snp\Wo{t)-Wo{t)\] < Lpp-P[p-^^'^ + sup {Wo{t)}\. 

{t>0} t>0 {t>Q} 
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Next, consider //. Write 

(5.42) // < sup [^{t)Wo{t)] + sup [^{t)\Wo{t) - Wo{t)\] = Ila + lib. 

{t>0} {t>0} 

On one hand, elementary calculus shows that Ila < p^^ . On the other hand, 

by similar argument as in the proof of Lemma 5.6, lib < Lp{p~^ +p~~~^ -\- 

p^'^^l'^). Combining these, // < Lp{j>~^ ^ Last, consider 

///. By (5.39) and direct calculations, 

III < CKpep sup|i>o}{^'(t)/^ ^{t) + ep^it - r^)} < Lpp-^. 

Inserting these into (5.41) gives the claim. 

Next, we show Lemma 2.4. The first claim has already been proved in 
Lemma 5.6. So we only need to prove claims (a)-(c) in the case of r > /?. 

First consider claims (a) and (b) in Lemma 2.4. Comparing Lemma 5.6 
and the desired claim, it is sufficient to verify that 

(5.43) Woit) < p-'^^'^/V2, if t < V2/3 logp - Ai or t > Tp, 

where Ai = do (log log p) / y/log p is as defined in the statement of Lemma 
2.4. Once this is proved, recalling that W{t) = ai{t)Wo{t) — Wi{t) and we 
have just proved supj>^-i(;^/2){^(*)^o(0} < Lpp~l^ , then by lemma 5.7 we 
have 

W{t) < a{t)Wo{t) < (1 + C^{t) + CKpep)WQ{t) < p-^/^/V2. 

We now proceed to prove (5.43). By the proof of Lemma 5.6 (inequality 
(5.31)), we have 

(5.44) < Woit) - Woit) < Lpp~^ + Kpep^ar,it)/^^it) + Kpep^ar.it), 

where we have noted that the last term in (5.31) is bounded by KpEp ^ ^(i+a)Tp it) ^ 

Lpp~^ . First consider the case when t < -v/2^51og^ — Ai. By Mills's ratio, 
for appropriately chosen do in Ai = do (log log p) / y/log p, we have ^'(t) + 
KpEp^arpit) > SK^Ep, and '^(t) + ep^arpit) > Sep. As a result, 

^ < y2i;/4, Woit) < ^ ^^^--^^^ < ^p/4. 

^^it) + Kpep^arpit) ^^it) + ep^rpit) 
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Inserting these into (5.44), we complete the proof of (5.43) when t < t/2/3 logp- 
Ai. Now we consider the case of t > Tp. Since ep^arpit) = o{epp~^^~"'^ ''), it 
fohows that 

W^Ziii!^ < jKpep^^r^it) = o{p-^/^) 



^{t) + Kpep'^aTpit) 



and 



Inserting these into (5.44) proves (5.43) when t > Tp. 



'^(t) + ep^rp{t) ^ ^ 



Finally we prove part (c). Write for short Sp = y/2P logp — Ai. By (5.40) 
and recalling that we have just proved supj>o^i(0 ^ Lpp~l^, we obtain 
that Wit) = ai{t)Wo{t) - Wi{t) > (1 - Kpep)Wo{t) - supj>o Wi{t) > (1 - 
C Kpep)Wo{t) — Lpp~^ . Further recall that in Lemma 5.6, we have shown that 
Wo{t) > Wo{t) for ah t > 0. Thus, Wit) > (1 - CKpep)Woit) - Lpp^^. Tak- 
ing t* = ^^Tp, it is seen that for sufficiently large p, Sp < t* < Tp. Therefore, 
snP{sp<t<rp} Wit) > (1 - Ci^pep)sup|,^<i<,^|M?o(|) > (1 - CKpep)Woit;), 
and the first inequality of part c) follows from Woit*) ~ On the 

other hand, by Lemma 5.6 and recall r > /3, we have sup^^o Woit) < 
LpSupj>o^o(0 ~ Lpp'^/"^. Further, by (5.40) and the expression Wit) = 
aiit)Woit) - Wiit), we have sup,^<j<,^ VF(t) < sup,^<i<,Jai(t)VFo(t)} < 
C supgp<j<^^ Wait) ~ Lpp~^^'^. Thus, the second inequality in the claim fol- 
lows. □ 

5.11. Proof of Lemma 5.7. Let hoit), hfit) and giit) be as in Lemma 
5.4. Consider the first claim. By Lemma 5.2 parts (a) and (e), we have 

(5.45) < ^it)-hoit) < Kpep^it), Fit) > il- Kpep)[^ it) +ep^rM- 

At the same time, note that Fit) < ^it) + KpEp. Combining these ensures 
that 



(5.46) 1 < (1 - F(t))^^/2 < _ _ ^^^^^ 



-1/2 



Inserting (5.45) and (5.46) into the definition of VFi(i) gives 

Kpep^it) 



< Wiit) < 



(1 _ ^(t) _ Kpep)il - Kpep)[^it) + ep^rM 
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Thus the first claim follows by noting (1 — ^{t) — KpEp) > 1/2 — KpEp for 
all t > ^-^{^). 

Consider the second claim. Recall that F{t) = /io(i) + hf{t) + /i^^ + gi{t). 
By definitions, 

(5.47) ai{t) = {l-F{t))-'/^-I-II, 
where / = [h+{t) + (t) + giit)]/[ep^r,{t) + gi{t)], and 

II = ^ ^{t) + ep^{t - Tp) + gl{t)/^J ho{t) + /i+(i) + gi{t). 
By (a) and (b) in Lemma 5.2, we have 

(5.48) (1 - KpCp) < / < 1, 1 < // < (1 - Kpep)-^/''. 

Inserting (5.46) and (5.48) into (5.47), we obtain that there is a universal 
constant C > such that (5.40) holds. □ 

5.12. Proof of Theorems 2.1-2.2. The following lemma is proved in Sec- 
tion 5.13. 

Lemma 5.8. Fix (/3, r) G (0, 1)^ and a sufficiently large p. When t ranges 
in (0, oo), Wo{t) first strictly increases and reaches the maximum at t = 
tp* ~ min{2, ^-^}Tp (= t*), and then strictly decreasing. Additionally, if 
r < 13, then there are positive constants C4 = C4(/3,r) and C5 = C5(/3,r) such 
that for all \t - t*p*\ < cat'^ , Wlf{t) < -2c5Wo{t). 

Denote by W{t) = p~^/'^HC{t, F). By the first claim in Lemma 2.3 and 
Lemma 5.6, and noting that /3 > co(/3,r, a), we obtain 

(5.49) sup|,>o} \W{t) - W^{t)\ < Lp\p-P +p-''^^P^^^-^ sup|,>o| Wo{t)]. 

First, we show Theorem 2.1, where we assume r < (3. Once the first claim 
is proved, the second claim follows by combining Taylor expansion with 
Lemmas 2.3, 2.4, and 5.8, so we only show the first claim. The idea is to 
prove Thc and Ti^eai are both close to t**, then they are close to each other. 

We first prove that Thc and t** are close. We will show that (i) Vr(t** + 
u) - W{t*/) < for ah |n| < a/Tp, and (fi) W{t) - W{t*p*) < for ah \u\ > 
c^/Tp. Then combining these proves 

(5.50) \THc{F)-t;*\ <p-'\ 
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with ci = ci(/3,r, a) > some constant to be specified later. 

We now prove the first case (i). Recall that t** is the maximizer of VFo(i) 
and Woit*p*) = Lpp-^(^'^), where 6{/3, r) is as in (2.10). Thus, W^it**) = 0. By 

Taylor expansion, Wo{t** +u) - Wait**) = (ip), where ip lies between 

t** and tp* + u. Next, by Lemma 5.8, for |n| < ^ we can further write 

Wo{t;* + n) - Woit;*) < -C5U^Wo{ip) = -c^^Woit;*) - c^u^Woiip) - 
Wo{t*p*)) < -C5u'^Woit*p*) - C5u'^{Wo{tp + u) - Wo{t*p*)), where the last step 
is because of Wo{t** + u) < Wo{tp). Thus, the inequality can be further 
written as Wo{t*p* +u) -Wo{t**) < -C5u'^Wo{t*p*)/il + C5u'^). Then by (5.49) 
we obtain that 

(5.51) _ _ 

Wit;* + n) - w{t;*) = {w{t;* + n) - Wo{t;* + u)) - {w{t;*) - Wo{t;*)) 

+ {Woifp* +U)- Woifp*)) < Lp{p-P + p-^0(/3,r,a)^^(^**)) ^ +u)-Wo 

< Lpp-P + (Lpp-^o('^'^''^) - C5nV(l + c^v?))Wo{t*;) 

It is easy to check that p-'=o(^'''''')Wo(t;*) > Lpp'^ when p*q{I3) < r < 
p. By Lemma 5.8, we obtain that if |n| > p^'^^ with ci = ci(/5, r, a) G 
(0, |co(/3, r, a)), then for all \u\ < c^/Tp, 

Wifp* +u)- W{tl*) < -Lpp-2'=i(^'^''^)T^?o(r)(l + o(l)) < 0, 

which completes the proof of case (i). It remains to prove case (ii). Di- 
rect calculations yield Wo(i** ± Ci/Tp) < e~'^^WQ{t*p*), where C5 > is 
a constant depending on whether r < /3/3 or r > /3/3. By Lemma 5.8, 
Wo{t) < Wo{t*p* ± C4/Tp) < e-^'^Woit;*) for all \t - > Ci/Tp. Thus, simi- 
lar to (5.51) we have T^(t)-W^(t;*) < Lp{p-^ +p-''o^^''^''''^Wo{t*p*)) + {Wo{t)- 
Wo{t;*)) < Lpp-P + (e-^5 _ 1 + ^^p-co(/3,r,a))^^(^**) ^ j^^p-p + (g-C5 _ ^ + 

l^p-co{p,r,a)^^-S{P,r) ^ yfj^arfi the last step is because /? > (5(^,r). This 
proves case (ii). Consequently, we have proved (5.50). 

Using similar method as above and in view of Lemma 2.1 we can also 
prove that for appropriately chosen ci > 0, 

(5.52) \Tideai{ep,Tp,n)-tl*\<p-'^\ 

Thus the claim in Theorem 2.1 follows when r < (3. ^ 

We now show Theorem 2.2, where we assume r > /3. In this range VFo(t) 
is maximized at t** = ^^Tp and VFo(tp*) ~ p~2'. By Lemma 2.4 we see that 
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the maximizer of Wo{t) is in the range [\/2/3 logp — Ai, Tp). By (5.43) and 
Lemma 2.3 we obtain that if < t < -v/2^3Togp — Ai or Tp < t < oo, 

W{t) = Wo{t) + {W{t) - Wo{t)) < -^p-^/^ + Lpp-^ = ^V~^'\^ + 0(1)), 

and if V2^1ogp - Ai < t < Tp, 

Wit) = W^{t) + {W{t) - t^o(t)) > V-^'^ - L.pp-^ = p-^'\l - 0(1)). 

Thus, the maximizer Thc{F) is in the interval [-v/2/31ogp — Ai, Tp). 

By Lemma 2.2, the maximizer of Sep{t, €p, Tp, 0) is in the interval [\/2P logp— 
Ai,Tp + A2). Thus, Theorem 2.2 follows immediately from Lemma 2.2. 

□ 

5.13. Proof of Lemma 5.8. Let ^/^^-^(t) = 0(t — Tp) + (/)(t + Tp) and V'(*) = 
2</.(i). Introduce mo (t) = m/^it), mi{t) = ^^^{t)/i>r,{t), d{t) = -f,^{t)/^r,{t), 
a{t) = epV'r,(i)/V'(t), R{t) = mi(t)/mo(t), and5(t) = {l/2){l+a{t))/{R-^{t)+ 
a{t)). The following lemma is proved in Section 6.4. 

Lemma 5.9. Fix a sufficiently large p, R{f) > 1 and is strictly decreasing 
for all t > 0. 

Consider the first claim. By direct calculations and our notations. 

To show the claim, it suffices to show that equation g{t) = 1 has exactly one 
solution. Recall that g{t) = (1/2)(1 + a{t)) /{R~^ {t) + a{t)), where R{t) > 1 
and both a{t) and R~^{t) are strictly increasing in t. It follows from basic 
calculus that g{t) is strictly decreasing in (0, 00), and the equation g{t) = 1 
has at most one solution. 

The equation also has at least one solution. Note that g{0) > Ce'^p^'^ 
which > 1 for sufficiently large p, it suffices to show that there is a t such 
that g{t) < 1. We show this for the case of r < (3/3 and r > /3/3 separately. 
In the first case, for all t such that \t — 2Tp\ < 4:Tp^, a{t) is algebraically 
small, and so by Mills' ratio [41], for any fixed 5, 

ff(2Tp + feTp-i) < ^ - f + 0{t-' 



2 2 



and the claim follows. Note that this shows that the solution i** of the equa- 
tion g{t) = 1 satisfies \t** — 2Tp\ < 2Tp^. In the second case, a(y21og(p)) = 
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Lpp^~l^~^^~^^'^ , where the the exponent > since r > /3/3 and r > 
(recall that p{j3) is the standard phase function). Therefore, gito) ~ 1/2 and 
the claim follows. This completes the proof of the first claim. 

Consider the second claim. We discuss for the case < r < /3/3 and 
/3/3 < r < /3 separately. 

Consider the first case. Recalling that — 2Tp\ < 2r~"'^, it is sufficient 

to show that for ah t such that \tp - 2Tp\ < At'^, WQ{t)/Wo{t) < -1/2. 
Introduce s{t) = [t%b{t) + d{t)^r,{t)] • [^{t) + ep^r,{t)]/[i^{t) + ep^r,{t)]'^. By 
direct calculations, 

(5.54) W"/W{t) = 1 + 11- hll, 
where 

/ = {g{t)-lf/ml{t), II = d{t)/mi{t)-m^\t), III = {s{t)-l)g\t)m^\t). 

Consider / first. When \t — 2rp| < 4:Tp^, on one hand, by Mills' ratio, 
m^^{t) ~ (t — Tp) ~ Tp. On the other hand, by similar argument, \g{t) — 
1| < C't"^. It follows that / < CTp"^. Consider // next. By Mills' ratio, 
m^^(i) = {t — Tp) + + 0{Tp ^). Since \d{t) — {t — Tp)\ is algebraically 
small, it follows from basic algebra that 1/ ~ —1. Consider ///. Note that 
both the ratio epipTpit)/ijit) and the ratio ep^ Tj,{t) / ^ Tp{t) are algebraically 
small. Combining this with ^{t)/ip{t) = (1/t) - (1/t^) + 0{t~^) gives 

(^(^))2 ) = l-^ + 0(Tp ), 

Recall that m^^{t) ~ Tp and g{t) ~ 1, it follows that /// ~ — 4Tp/t^ ~ — 1. 

Inserting these into (5.54) gives that for ah \t-2Tp\ < ATp^, Wll{t)/WQ{t) < 
—1/2 and the second claim follows. 

Consider the second case, where r > /3. For a constant rjo G (0, 1) to be 
determined, choose Iq and such that a(to) = ai^d a{t^) = (1 ± 

r]Q)a{to). It is seen that \t^ — ^^Tp\ < Ct'^, and |to — '^f'^pI — ^'^p^- 
Combining these with definitions and Mills' ratio, for t~ < t <tp, R~^{t) ~ 
{t - Tp)/t ~ (/? - r)/{l3 + r), and that 

, , 1 1 + a(t) 

(5.55) g{t) ~ - ^ ^' 



2 [(/3-r)/(/3 + r)] + a(t)' 



By direct calculations, g(tp) > 1 and g{tp) < 1. Since g{t**) = 1, we have 



Op ^ bp ^ bp . 
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We now use (5.54) to calculate Wg (t)/TVo(*) with. First, recall that // ~ 
— 1. Second, by similar argument, m^^{t) ~ (t — Tp) ~ (/3 — r)/{2r)Tp. Com- 
bining this with (5.55), 

. = m,Hi)m - If = - 11^ +o,l)). 

Last, by similar argument. 



t^Pit) + epd{t)i^r,{t) (/3 + r)/(/3 - r) + a{t) ,p-r ^ 



.2 



S{t) 



(/3 + r)/(/3 - r) + a(t)] • [(/3 - r)/(/3 + r) + a(0] 



(l + a(t))^ 



Combining this with (5.55), /// equals to (^)^r^ • [j^pz-rf/l^^^y^Y times 
[(/3 + r)/(/3 - r) + a(t)] • [(/3 - r)/(/3 + r) + a(t)] 

[ ot^Mp 

Inserting these into (5.54) and recalling that a(to) = (Sr — /5)/(/3 + r), it 
follows from basic algebra that W"{tQ)/W{to) < ■ {^fr'^, Recall 

that a(tp ) = (1 ± ryo)a(to)- By the continuity of / and /// on a{t), if we 
choose ?7o sufficiently small, then for all t~ < t < tp, 



<(i)/^o(t) < 



3r-/3 ,P-r 2 2 



4(/3-r) ' 2r ' 

and the claim follows. □ 

5.14. Proof of Lemma 3.1. The following lemma is proved in Section 
5.15. 

Lemma 5.10. As p — )• 00, there is a constant C > such that with 
probability at least 1 — o(l/p^), 

^iFpjt) - F{t)\ ^ f CK^{log{p)f/^ VO < t < v /2M^ , f > pF(t) > log^/^p), 
^F(t)(l-F(t)) " I CKl{log{p)y'/\ VO < t < y2Mp),pF(t) < log5/4(p). 

We now prove Lemma 3.1. Put an evenly spaced grid on [0, logpj by 
tk = {V2Wp/p^)k, 0<k<p\ Denote by V{t) = ^{Fp{t)-F{t)){F{t){l- 
F(t)))-i/2. For each ^ <i <p^ - I, we claim that 

(5.56) sup{,^<,<,^^^} \V{t)\ < max{|F(t,)|, |y(t,+i)|} + Lp/p. 
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In fact, as both Fp{t) and F{t) are monotone functions, we have 
Fpjtj+i) -FjU) ^ Fpjt) - F{t) ^ FpiU) - F{U+i) 



Let hi 



!F{ti) yjF{t) yjF{U+^) 

^^^'"^'^ Since F{t) < i, sup|j^<j<j^^j{|y(i)|} does not exceed 



(5-57) 

2(^max{W —\V{ti)\, Vhi\V{ti+i)\}+-^ ^= h ^ 



F{ti 



Fit 



i+l, 



Since the derivative of {—F(t)) is the density of a location normal mixture, 
and is therefore bounded from above. Moreover, for < t < \/2\ogp and 
sufficiently large p, F{t) > F{^/2logp) > 2(1 - K pep)^{^/ 2 log p) > p'^Lp. 
Using Taylor expansion, 
(5.58) 



^\F{U) -F(^,+i)| ^ ^\F{U) - F{U+i)\ ^ 



Lr 



F{ti+i) 



--+ 



Lr 



p^F{ti) Jp^F{ti+i 



< Lp/p. 



Similarly, we can show — 1| < Lp/p. Inserting this and (5.58) into (5.57) 
gives (5.56). 

Combining (5.56) with Lemma 5.10, the claim follows from 



sup 



{0<t<V21og(p)} 



^\Fp{t) - F{t)\ 



F{t){l-F{t)) 



sup \V{t)\ < C sup \V{ti)\ + ^, 

{0<j<p2} P 



{0<t<V21og(p)} 



where C > is some constant. 



□ 



5.15. Proof of Lemma 5.10. The following lemma is proved in Section 
5.16. 

Lemma 5.11. There are partitions {1, 2, . . . ,p} = i?'^ U i?2 • • • U ^'ni — 
R'l U i?2 . . . U R'l^^ such that Ni < CKp\og{p), N2 < CK^log{p), and that 
for any fixed I < j < Ni and 1 < k < N2, the collection of random vari- 
ables {Z (i) — fi{i) , i £ R'j} are independent of each other, and the same are 

We now show Lemma 5.10. The key idea is to combine Lemma 1.1 with 
the well-known Bennett's inequality (e.g., [37]). The Bennett's inequality 
only applies to sum of independent random variables. To apply it in the 
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current setting, note that by Lemma 5.11, we can partition {1,2, 
into N different subsets Ri, . . . , Rn , where N < CKplog^{p), such that 
the collection of random variables {Z{i) : i S i?^} are independent, for 
each 1 < A; < A^. In light of this, we write Fp{t) = | X^^Li 'S'p'^^(t), where 

Sp^\t) = J2i£R^. ^ is the sum of independent random variables, 

to which the Bennet's inequality can be applied directly. 

In detail, let = E[sl''\t)] and Sk = \Rk\, l<k<N, and S{t) = 

Ylk=i '^^'^^ Since we are only interested in the region of t such that F{t) < 
1/2, it follows easily that 

(5.59) v^|i^p(0 - m\ < V2\Sp{t) - S{t)\ ^ ^ V2\si'\t) - SW(0| _ 

For each 1 < A; < A^, using Bennet's inequality [37, Page 851] yields 

(5.60) P(|5W -5('^)(i)| > A) < 2exp(--^^(-^)), 

o (k^ 

where il) is as in [37, Page 851] and stcrf, = \ax{Sp {t)). First, note that 
xip{x) is monotonely increasing in x G (0, oo). Second, by definitions and ba- 
sic property of Bernoulli random variables, s^a'j, < S^^\t) < S{t). Inserting 
these into (5.60) gives 

p([S»-S«)W]>A)<e,p(-^,(^); 

Let A = Cy/{logp)S{t) if S{t) > i(logp)5/4 and A = C(logp)3/2 ff S{t) < 
^(logp)^/^, where C > is a constant. By elementary calculus and the 
property of tp, 

V / [ exp( f^), 5(t) < 2(logp)°/*. 

Inserting this into (5.59) and noting that pF{t) > (\ogp)~^^'^ give the claim. 

□ 

5.16. Proof of Lemma 5.11. Recall that Z — /i ~ A^(0, fi), the first claim 
follows directly from Lemma 1.1. For the second claim, introduce a graph 
Q = {V, E) where V = {1,2, . . . ,p}, and nodes i and j are connected if and 
only if Si Pi Sj = 0, where Si = {1 < k < p : k) / 0}, 1 < i < p. Since 
O is ii'p-sparse, G is Xp-sparse. Also, fi{i) and fl{j) are independent if and 
only if nodes i and j are disconnected. Applying Lemma 1.1 to G gives the 
claim. □ 
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5.17. Proof of Lemma 3.3. Recall that Up = p^, Z = ClZ, and Z = OZ. A 
direct result of Lemma 3.2 is that there is a term < r^p < C Kp(\og p)p~^ 
such that with probability at least 1 — o{l/p), 

\H\Zij)\ > t}-l{|^(j)| >t}\<l{t- i]p < \Z{j)\ <t + i]p},yt>0 and l<j<p. 

Let Gp{t) = Fp{t - 7]p) - Fp{t + r]p) and G{t) = F{t - r]p) - F{t + r]p). By 
the above inequality, it is seen that with probability at least 1 — o{l/p), 

(5.61) \Fp{t)-Fp{t)\<Gp{t). 

We now analyze Gp{t). By definitions and the triangle inequality, 

(5.62) Gp{t) < G{t) + \Fp{t - r]p) - F{t - rip)\ + \Fp{t + r]p) - F{t + rip)\. 
A key fact is that there is a universal constant C > such that 

(5.63) \F'{t)\ < G{KpTp + t)F{t). 

To see the point, we write F{t) = i ^Li and F'(t) = -i ELi E^i- 

y/n^li{i)) + 4>{t + y^/i(z))], where (f) is the density function of A^(0, 1). 
Note that there is a constant C > such that (p{x) < C\x\^[x), and that 
\t lb y^n^/i(i)| <t + KpTp for all 1 < i < p, the desired claim follows. 

Now, first, write G{t) = F{t — rjp) — F{t+r}p) = 2r]pF'{S^) for some number ^ 
with < r]p. Using (5.63), |F'(0| < CKpTpF{0 ~ CKpTpF{t). It follows 

(5.64) G{t) < CKpTpF{t)r]p. 

Second, by Lemma 3.1 and monotonicity, with probability at least 1 — o(l/p), 
\Fp{t±rjp)-F{t±r]p)\ < CK^{logpy/^p-^/^{F{t±r]p))^/^ , whereby (5.63), 

F(t lb rjp) X F{t). It follows that with probability at least 1 — o{l/p), 

(5.65) \Fp{t ± Tjp) - F{t ±r]p)\< CKl{\ogpfp~^'\F{t)f'\ 
Recall that rjp < Kp{logp)p~^^'^ . Inserting (5.64)-(5.65) into (5.62) gives 

(5.66) Gp{t) < CK^{\ogpf'^p-"^F{t) + GKl{\ogpfp-^/\F{t)Y/\ 
Combining (5.66) with (5.61) gives 

(5.67) 

- ""^^'^^ < . < C[K^p{logpf/\p'-^mf" + Kli\ogpY), 

^F{t){l-F{t)) ^F{t){l-F{t)) 

and the claim follows. □ 
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5.18. Proof of Theorem 3. 1 . We consider the case whenpF(i) < K^{\og(jp)Y 
and when > Kp(log(p))^ separately. 

In the first case, it is sufficient to show that \HC{t^Fp)\ < Lp and 
\HC{t, Fp)\ < Lp. By Lemmas 3.3 and 5.10, with probabihty at least 1 — 
o{l/p),_p\Fp{t) - F{t)\ < Lp. By Lemma 5.4, F{t) > (1 - Kpe_p)^{t) and 
thus, p^{t) < Lp. Since HC{t, Fp) is defined in a way such that Fp{t) > 1/p, 
it is easy to see that HC{t,Fp) < p\Fp{t) - ^{t)\ < p\Fp{t) - F{t)\ +pF{t) + 
p^{t) < Lp. Similarly, we can prove that HC{t,Fp) < Lp. The claim follows 
easily. 

In the second case, let h{t) = {F{t){l- F{t)))/{Fp{t){l- Fp{t)) and write 

for short g{t) = ^{Fp{t) - Fp{t)) {F{t){l - By definitions, we 

can write 

(5.68) HC{t, Fp) - HC{t, F) = g(t)y^ + HC{t, F){y^ - 1). 

We first prove \h{t) — 1| < o(l). To see this, note that (5.67) and Lemma 
5.10 ensure that with probability at least 1 — o{l/p), 

(5.69) \Fp{t)/F{t) - 1| < liFpit) - Fp{t))/F{t)\ + \Fp{t)/F{t) - 1| 

< CK*p{\ogpf'^p-'/^ + CKl{\ogp)\pF{t))-^l\ 

By the assumption oipF{t) > -fCp(logp)^, the right hand side of (5.69) tends 
to 0. Thus, with probability at least 1 -o(l/p), < Fp{t),F{t) < 2/3 for all 
0<t< Y^21og(p). Note that for all X, y G (0,2/3), \[x{l-x)]/[y{l-y)]-l\ < 
C\x/y — 1|. It follows from (5.69) and definitions that 

(5.70) \h{t) - 1| < C\Fp{t)/F{t) - 1| < Lp{p-'/^ + {pF{t))-'/'), 

where the right hand side tends to since pF{t) > Kp{logp)^ . At the same 
time, since F{t) > (1 - Kpep)^{t), we have \F{t) - ^{t)\ < F{t) + ^{t) < 
2F{t). It follows from 1 - F{t) > 1 - ^{t) - KpCp > 1/2 - Kpep that 

(5.71) |/7C(i,F)| = Vp|F(t)-^'(t)|(nO(l-i^W)"'/' <C(pFW)'/'- 
Combining (5.70) and (5.71) gives 

(5.72) HC{t,F)\^)-l\<Lp[{p^-'F{t)f'^ + l]. 

At the same time, a direct use of Lemma 3.3 also gives that with probability 
at least 1 — o(l/p), 

(5.73) g{t)<Lp[{p^~'F{t)fl^ + l]. 

Inserting (5.72) and (5.73) into (5.68) and recalling \h{t) — 1| — )• gives the 
claim. □ 
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5.19. Proof of Theorem 3.2. Write for short Wp{t) = p~^/^HC{t, Fp) 
and W{t) = p^^/^HC{t,Fp). 

First consider the case oi 6 > i. By triangle inequahty, Theorem 3.1, and 
Lemma 2.3 we have 

(5.74) 

sup \Wp{t) - Wo{t)\ < sup \Wp{t) - W{t)\ + sup \W{t) - Wo{t)\ 
<s--^{l)<t<s; ^.-i(i)<t<^j t>'i'-i(|) 

< Lp{p~^+p~'/',/m+p-'/') < Lp{p-'/'+p-^). 

This result is parallel to Lemma 2.3. When r < /3, similar to (5.51) we can 
obtain that for all u satisfying \u\ < 04 /Tp, 

(5.75) 

Wp{t;* +u)- Wp{t;*) < Lp{p-i +p-P) + [Lpp-^o(^'^''^) - sup VFo(t), 

for some constant cj > 0, where t** is as in (5.51). It is easy to check 
that sup|i>o}T^o(i) = Lpp'^^^'"''^ > p'^ > p~^/'^, co(/3, r, a, 6*) < /3, and 
p-co{i3,r,a) supj>o Wo{t) > p~^ . Tlius, for any u > Lpp-'=2(^'^'«) with C2(/3, r, a) < 
min{ ^-y'") , it holds that Wp{t;*+u)-Wp{t;*) = -Lpp-2^2(/3,r,a)(i^ 

0(1)) < for all \u\ < c^/Tp. Again, using similar arguments as in Theorem 
2.2, we can prove that Wp{t) - Wp{t**) < for all \t - t**\ > c^/rp. Thus, 
we have proved that 

|tf ^ -t*;\ = \thc{Fp) - < Lpp-'^^p^^^'^\ 

This together with (5.52) completes the proof of the Theorem when r < (3. 

Now we consider the case where r > /3. If t > Tp or t < ^2/5 logp— Ai with 
Ai = do log logp/Vlogp, by Lemma 2.4 and (5.74), it holds Wp{t) = Wo(i) + 
(Wp{t) - VF'o(i)) ^ -^P'^^'^ + ^pP''^ + Lpp-'^''^. Recah that /3 < 1 - < 0. 
Thus Wp{t) < -l=p-^/2(l + 0(1)). If ^/2j3Togp - Ai < t < Tp, using similar 

argument we obtain that Wp{t) = Wo{t) + {Wp{t)-Wo{t)) > ?'"^/^(l-o(l)). 
Thus, 

tf^G (V2/31ogp-Ai,rp) 
and the claim in the theorem follows. 
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Next we consider the case where 6 < ^. By Theorem 3.1 and Lemma 2.4 
and noting that I — 9 > P > for any t,t + u ^ [sp{(^), Sp] we have 

Wp{t + u)- Wp{t) = {Wpit + u)- Wo{t + u)) - (Wpit) - Woit)) 

+ {Wo{t + u)- Wo{t)) < Lpp-'/^^/¥(t) + Lpp-^ + {Wo{t + n) - Wo{t)). 

Since p-^F{t) < p-^+^ and /3 > (1 - 9)/2, it follows that 

(5.76) Wp{t + u)- Wp{t) < Lpp-^^-^'^^^ + Lpp-^ + (Woit + u) - Wo{t)). 

So the stochastic behavior of Wo(t) in the range t £ [sp{0),s*] determines 

the stochastic behavior of Wp{t + u) — Wp{t). By direct calculations, we 
obtain that if (/?, r, 6) falls in either of the six sub-regions as follows 

• 1/3 < ^ < 1/2, {l-e)/2 < p <l-e,r> max{p*g{p), 

• I < < i (1 - 9)/2 < 13 < l-2e,r > max{^, p^(/5)}, |r - 
VI - 26*1 > y/1-20- p 

• l<e<ll-29<(3<l-e,r> max{^,pim 

• < < i, ( 1 - 0)/2 < /3 < 3(1 - 20)/4, r > max{|, /);(/3)}, |r - 
VI -2^1 > ^/l-29-P 

• < g < i, 3(1 - 2g)/4 < p < 1 - 26, r > max{i^, p;(/3)}, |r - 
VI - 26*1 > VI - 261 - /3 

• O<0< i, l-2^</3<l-0, r>max{i^,p;(/3)}, 

then t** E (sp(0),s*) and the maximum of Wo{t) is achieved in {sp{6),s*). 
So it reduces to the 9 > 1/2 case. Note that the six regions above can be 
summarized into Condition (a)-(b) in Theorem 1.3. By (5.76) and using 
similar proof as that for > ^ we finish the proof of Theorem 3.2. □ 

5.20. Proof of Lemma 3.4- Introduce np(t) = Up{t,ep,Tp,Q) = Yl^=i ^[f^U)'^' 
l{\Z{j)\ > t}] . The following lemma is proved in the appendix. 

Lemma 5.12. For any t > 0, there are universal constants Ci > 
and C2 > such that for sufficiently large p, Ci min{t, j^—^==\^Jn^ < 

7j{tZZSl - ^2(l + i)V^ andmp{t,ep,Tp,a) < C2{l + t)K^T^n-^^^pF{t), 
where F{t) is defined in Lemma 5.2. 

The following lemma is proved in Section 5.21. 
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Lemma 5.13. There is a constant C > such that with probability at 
least 1 - o(l/p), for all < t < ^21og(p), 

(5.77) ^\Mpit,Z,fi) - mpit,ep,Tp,n)\ < CK^ilogp)^^pF{t), 

(5.78) \Vp{t,Z,fi)-Vp{t,e.p,Tp,n)\ < CK^{\ogpf''' ^Jp^{t). 

Write for short Vp{t) = Vp{t, Z, fi), Mp{t) = Mp{t, Z, n), mp{t) = mp{t, e^, Tj 
Vp(t) = Vp{t,ep,Tp,n), Sep(t) = Sep{t,ep,Tp,Q), Sep{t) = Sep{t, Z , 
F{t) = F{t,ep,Tp,il.) and Fp{t) = Fp{t, Z, We consider the two cases 

1) t > Tp + Sp or pF{t) < iC^(logp)^°, and 2) t < Tp + Sp and pF{t) > 
Kp{logpy^ , separately, where Sp is defined in Lemma 5.4. 

Consider the first case. It suffices to show (la) p^^''^''/'^ Sep{t) < Lpp-^/'^ + 

^^p-max{4/3-2r,3/3+r}/4 ^^b) p^^^-^^^ Sep{t) < LpP" "^'^^{4/3-2r,3/3+r}/4 ^ 

Lpp-^/^. Claim (la) can be proved using the same arguments as in Lemma 
2.1, so we only need to prove (lb). 

Consider (lb). Let be ap x 1 vector such that ri{j) = l{(ri/if )(j) ^ 0}, 
1 < J < P- Also, for any p x 1 vectors x and y, let x o y be the p x 1 
vector such that (x o y)(j) = x{j)y{j), 1 < j < p- By definitions, it 
is seen that Mp{t) = {fi^Y^fi = {jlfyQ{^ o t]). Using Cauchy-Schwartz 
inequality, \Mp{t)\ < ((/if)'Jl/if)^^^((/i o r/)'0(/x o ?y))^/^ Recalling that 
Vp{t) = {flfy^flf', it follows that 

\Sep{t)\=2\Mpit)\iVp{t)y'/^ <2{{fior^yn{fion)f\ 

Since the largest eigenvalue of 0, is no greater than Kp, the last term above 

1/2 1/2 

< 2Kp ll/zor/ll and so \Sep{t)\ < 2Kp ||/ior/||. At the same time, by Lemma 
3.1, with probability at least 1 - o(l/p), pFp{t) < p\Fp{t) - F{t)\ +pF{t) < 
Lp{pF{t))^/^ + pF{t) < Lppi-™'^^{4/3-2r,3/J+r}/2 if ^ > ^-^ + Sp. Similarly, we 

can show that pFpit) < Lp iipF(t) < Kp(logpy^ . Thus, in case (lb) we have 
pF{t) < Lppi-"^^^{4/3-2r,3/3+r}/2 _^ gy definitions, this implies that /if 
has no more than Lpp^~^^^^'^^~'^^''^^^'''^^'^ + Lp non-zero coordinates. Since Q 
is i^p-sparse, r] also has no more than ^ppi-™ax{4/3-2r,3/3+r}/2 _|_ nonzero 
coordinates. Therefore, \\fior]\\ < Lpp^-'^^^^4/^-2'-'3/3+n/4 + ^^^-e/2^ ^^^^ 
(lb) follows from the assumption that Kp < Lp. 

Consider the second case. Denote h{t) = Vp{t)/Vp{t). The key is to show 

(5.79) \h{t) - 1| < Lp{pF{t))-^/\ 
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Towards this end, we write \h{t) - 1| = I ■ II ■ h{t) ■ {pF{t)Y'^/'^ , where 

I = \Vp{t) - Vp{t)\{pF{t))-^l'^, and // = {pF{t)) /vp{t). First, by Lemma 
5.13, I < Lp with probabihty at least 1 — o{l/p). Second, by Lemma 5.4, 

II < C with some constant C > whose value depends on whether r < /3 
and t < Tp+Sp or r > /?. Last, by Lemma 5.13 and (5.78), with probability at 

least 1 - o(l/p), Vp{t)/vp{t) > 1 - CK\\ogpfl'^ > 1 - where 

we note that pF{t) > K^{logp)^ and CK^{log{p)Y/^{pF{t))^/^{vp{t)y^ < 
K^{\og{p)f/'^{pF{t))~^l'^ = o(l). As a result, with probability at least 1 - 

o(l/p), h{t) = < 1. Combining these gives (5.79). 
Next, write 

(5.80) \Sep{t) - S7p{t)\ = l^pL - < III + IV, 

^Vp(t) vSW 

where/// = \Mp{t)-mp{t)\^/h({) / ^Vp{t) and IV = mp{t)\y^h{t)-l\/ ^Vp{t). 
Recall that h{t) < 1 + Lp and that CpF{t) < Vp{t). It follows from Lemma 
5.13 that with probability at least l-o(l/p), /// < \Mp{t)-mp{t)\{pFp{t)Y'^/'^ < 
LpHp At the same time, note that IV < \h{t) — l\mp{t){vp{t))~^/'^. On one 
hand, by Lemmas 5.4 and 5.12, mp{t) < Lpnp^'^Up{t) < LpKpnp^^'^pF{t). 

^ —1/2 

On the other hand, since Vp{t) > CpF{t), by (5.79), we have IV < LpUp 
with probability at leats 1 — o{l/p). Combining these with (5.80) gives the 
claim. 

By going through the proof above we see that if further G A/* (a, Kp), 
then the two cases at the very beginning can be reduced to 1) pF{t) < 
K^{logp)^°, and 2) pF{t) > /^8(logp)^^ and the claim \Sep{t) - Sep{t)\ < 

— 1/2 

LpHp can be proved using same arguments. Thus, Lemma 3.4 is proved. 

□ 

5.21. Proof of Lemma 5.13. Write for short Mp{t) = Mp{t, Z, ji), Vp{t) = 
Vp{t, Z, /i), rripit) = mp{t, ep, Tp, J7), and Vp{t) = E[Vp{t, Z, fj,)]. The following 
Lemma is proved in Section 5.22. 
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Lemma 5.14. For any t G (0, a/2 log p], 
p(^^\Mp{t) - mp{t)\ > CKl{\ogpfy 

< CK'Jlogpf exp ( - _ M 



A2 / A 



P{\%{t) - vM > CK^pilogpfx) < CK^ilogpfexp - i; 



AKppF{t) ^2KppF{t) 



where ^ is as in Bennett's lemma [37, Page 851]. 



Since the proofs are very similar, we only show the first one. The goal is to 
show that with probability l-o(l/p3), \Mp{t)-mp{t)\ < CK^{log{p)f/'^{pF{t))-^/'^ 

for any < t < y^2 log(p). Once this is shown, we lay out an evenly spaced 
grid on [0, a/2 log (p)] with an inter-distance of 1/p, and the claim follows by 
similar argument as in the proof of Lemma 3.1. 

Since Lemma 5.12 ensures that mp{t) < CK^{\ogpf/^pnp^^'^F{t), by the 
monotonicity of x^{x) and Lemma 5.14, 

(5.81) 

p(^\Mp{t) - mp{t)\ > CKlilogpfX 



< Ci^,^(logp)^ exp ( - -— 



2CK^{\ogpYpF{t) Ci^2(logp)3/2pF(i)' 

We now show the desired claim for the case pF{t) > {log{p))^^'^ and the case 
pF{t) < (log(p))^/2 separately. 

Consider the first case. Let A = C Kp (log p)^/"^ \J pF{t) . Direct calculations 
showthat A/[i^2(iogp)3/2^^(^)] < C{pF{t)y^/^ and X'^ /[K^ {log p)'^pF{t)] > 
Clog{p)Kp. By (5.81) and noting that limx^o+ ip{x) = 1, 

P (^^\Mp{t) - mp{t)\ > CKlilogpf'^^pht)}^ 
< CK'pilogpfe.p ( - < ,(1//). 



Consider the second case. Let A = CKp{logp)^. It is seen that X/[K^{logpf/^pF{t)] > 
C {log{p))^/'^ / {pF {t)) . Using Lemma 5.14 where we note that ip{x) ~ i^^^^ii 
when X — oo [37, Page 852], 

P {y^p\Mp{t) - mp{t)\ > CK'pilogpy/'') < Cilogpfexp ( - < o{l/p^). 



65 



This together with pF{t) > p{l — Kpep)^{t) > (logp) yields the desired 
claim. □ 

5.22. Proof of Lemma 5.14- Since the proofs are similar, we only show 
the first one. By Lemma 1.1, we can partition {!,••• ,p} into N = N1N2 < 
Kp\og^{p) sets Ri, ■ ■ ■ ,Rn such that for any fixed index I < k < N , the 
collection of bivariate random variables Z{j)) : j G Rk} are indepen- 
dent of each other. Recall that Mp{t) = Ei=i /i(j)sgn(^0'))l{|^(i)| > t} 
and mp{t) = E[Mp{t)]. The partition allows us to write Mp[t) — mp{t) = 
T.k=M'\^) - r4\t)l where (t) = Y.,^^, Kj>Z<Z{j))l{\Zm > 
t] and mf\t) = E[M^''\t)], l<k<N.lt follows that for any A > 0, 

N 

(5.82) P{^\Mp{t)-mp{t)\ > NX) < P(^|M«(t) -m«(t)| > A). 

k=l 

Fix 1 < k < N , using Bennett's inequality [37, Page 851], 

- > A) < exp (-^^H^T^T?^)) ' 

where ip is as in [37, Page 851], and |-Ra;|<t^ is the variance of ^Jn^Mp {t). 
Using Lemma 5.12, |i?fc|cr^ < npUp{t) < Kpy^2\og{p)npmp{t) . By the 
monotonicity of the function xiIj{x) [37, Page 851], it follows that 

p(jrn\Mi''\t)-mi'\t)\ > X) < exp ( , ^'"^^ ib( ^^^)V 

^ p Wl- 2Kpy/2 log{p)npmp{t) '^^^mp{t)U 

Inserting this into (5.82), the claim follows by recalling N < CKp\og^{p). 

□ 

5.23. Proof of Lemma 3.5. Write for short Mp{t) = Mp{t, Z, /x), Mp{t) = 
Mp{t, Z,fi), Vp{t) = Vp{t,Z,fi), and Vp{t) = Vp{t,Z,fj,), m.p{t) = mp{t,ep,Tp,Q), 
and Vp{t) = Vp{t,ep,Tp,^l). We discuss the case 1) t > Tp + Sp or pF(t) < 
A'^°(logp)^° and the case 2) t < Tp + Sp and pF{t) > Kp^{logp)^^ separately. 

Consider the first case. First, in the proof of Lemma 3.4, we have shown 
that Sep{t,Z,n,n) < Lpp^-i™^^^4^-2^'3/3+'-} + Lpp-^/2. Second, by simi- 
lar argument as in the proof Lemma 3.4 part (lb), and using Lemma 3.3, we 
can prove that Sepit, Z , fi,n) < Lpp^'i + Lpp^^/^. Com- 
bining these gives the claim. 
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Consider the second case. The key is that with probability at least 1 — 
o(l/p), 

(5.83) 

max{V^|Mp(t) - Mp(t)|, |t> (t) - Vp{t)\} < Lp ■ \p^-''^F{t) + {pF{t)f/\ 
(5.84) 

max{|(Fp(t) - vp{t))/vp{t)l \^{Mp{t) - mp{t))/vp{t)\} = o(l). 

for all t < Tp + Sp. To see (5.83), note that \Mp{t) - Mp(t)| = |(/if - 
fify^fi\ < ll/if — /if 111 • ||r2/u||oo, where by the Kp-sparsity of U, \\Q,fi\\oo < 
KpTpUp^^'^ , and so \Mp{t) — Mp{t)\ < KpTpUp^^'^Wfif — fifWi- Similarly, 
W^ifif +_/if)||oo < IpllillAf + Af lloo < 2Kp, and so \Vp{t) - Vp{t)\ < 
— fi^)'^{fif + fif)\ < 2Kp\\jlf — /if 111. By similar argument as in the 
proof of Lemma 3.3, it is seen that with probability at least 1 — o(l/p), 
llAf ~ Af 111 ^ pGp{t), where Gp{t) is defined therein. It is shown in Lemma 
3.3 that Gp{t) < CK^{\ogpf/'^p-^/'^F{t) + CKl{\ogpYp-^/'^{F{t)Y''^ with 
probability at least 1 — o{l/p). Combining these gives (5.83). 

To see (5.84), note that by Lemma 5.13, with probability at least 1 — 
o(l/p), 

(5.85) \Vp{t) - vp{t)\ < CK^p{\og{pfl\pF{t)f'\ 

Recall that by Lemma 5.4, Vp{t) > CpF{t) with some constant C > 
whose value depends on whether r < /3 or r > /?. Combining this with 
the fact that pF{t) > Kp'^{logpy^ for all t < Tp + Sp, it is seen that 
CK^{log{p)^/'^{pF{t)y/'^ = o{pF{t)) = o{vp{t)). Inserting this into (5.85) 

gives that |(T4)(0 ~ ^p(*))/^p(OI = ci(l) with probability at least 1 — o{l/p). 
By similar argument, \^Jn^{Mp{t) — mp{t))/vp{t)\ = o(l) with probability at 
least 1 — o(l/p). Combining these gives (5.84). 

We now proceed to show the lemma in the second case. Let h{t) = 
Vp{t)/Vp{t). Write 

(5.86) ^\Sep{t, Z, ^i,Cl)-Sep{t, Z, ^i,n)\ < ^/l/h{t)■I+\^/l/h{t)-l\■II, 

where / = ^\Mp{t) - Mp{t)\{Vp{t)y^/^ and // = ^Mp{t){Vp{t)y^/^ . 
Recall that by Lemmas 5.4 and 5.12, ^/n^nip (t) < Kl{\ogpfl^pF{t) < 
Klilogpf/'^Vpit). Using Lemma 5.4 and (5.83)-(5.84), 

\h{t) - 1| = Pp^^J^Wpit) - VpmpF{t)r' < Lp[p~'/-' + {logpf/\pF{t)) 
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and 

II < {pF{t)/vp{t)){pF{t)r^^[\Mp{t) - mp{t)\ + mp{t)] < L^. 

Recall that pF{t) > Kp^[logp)^^. This together with the inequality above 
for h{t) ensures that \h{t) — 1| < o(l). Inserting these into (5.86) gives 

\Sep{t, Z, fx, Cl) - Sep{t, Z, fi,Q)\<Lp- n~^l\p-'/\pF{t)f/^ + 1], 

and the claim follows. 

Similarly to Lemma 3.4, we see that if in addition $7 G M*{a, b, Kp), then 
the term Lpp~ 4 max{4/3-2r,3/3+r} -^^ ^^iq upper bound of the claim can be 
removed using the same proof as above. This concludes the proof of the 
lemma. □ 

5.24. Proof of Theorem 2.3. Note that for any < x < 1, 

, /log log ( ,„ ) 

(5.87) ^-\x) = V^2T^ + 0{ ^^^^^ 

V log X ^ 

where the last term is negligible compared to the first term. So ^>~^(misclassification|t) 
can be well approximated by E{t) = ^-2 log {P{YLt{X, Vt) < 0\t)) . 

We write Sep{t) = Sep(t, ep,Tp,Q,), Sep{t) = Sep{t, Z , fj,,Q), and Tideai = 
Tideaii^pjTp,^) for short. The following lemmas are proved in Section 5.25 
and Section 5.26 respectively. 

Lemma 5.15. Fix a constant k, > 0. As p ^ 00, for any sequence tp G 
{0,Tp + Sp] with Sp defined in Lemma 5.4 such that Sep{tp) > Lpp'^, we have 
P{YLt{X,n) < 0\t = tp) = $((1 + o{l))^S^p{tp)). 

Lemma 5.16. For any sequence of closed subset Ap C [0, Tp + Sp] with Sp 
defined in lemma 5.4, if there exists a constant k > such that supi^j^^{Sep{t)} > 
p'^ for sufficiently large p, then 

sup E{t) < I- sup Sep{t). 

t&Ap ^ 0<t<Tp+Sp 

Now we proceed to prove the theorem. The key is to show 

(5.88) min P{YLt{X,n) < 0\t) > ^({l + o{l))ls^p{Tideai 

t>Tp + Sp \ Z 

(5.89) min _ P{YLt{X,n) < 0\t) = ^({l + o{l))ls^p{Tideai 
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Then combining the above results completes the proof of the theorem. 

We first prove (5.88). When r < /3, by proof (lb) in Lemma 3.4 we have 
Sepit, Z, ^, f)) < Lpp^-i for all t > Tp+Sp with probability 

at least 1 — o{p~^). When r > /?, by Lemma 3.4 we have 

Sep{t) = S^p{t) + (Sepit) - S7p{t)) < S^p{t) + Lpp'^'^. 

Following the same lines as those in the proof of Lemma 2.2 we can show that 
for r > /3, Sep{t) < Lpp~^ '^^ with c% = cs{(3,r) > 6{l3,r) for aU t > Tp + Sp. 
Combining these and recalling that r > Pg{(3) and (3 € {^-^, I — 0), we have 

Sep{t) < Lpp^-'=9('^''^) with cg{/3,r) some constant whose value depends on 
whether r < /3 or r > /3 and satisfies cg{f3,r) > J(/3,r), for all t > Tp + Sp, 
with probability at least l — o{p~^). Recall that Sep{Tideai) = Lpp~ '5(^,r)^ 
Thus, 

P{YLt{X,n) < 0\t) = ^(^Sepit)) > $(Lpp^-'=9('^'^))(l - o{p-')) 
»^>((l + o(l))^5^(rirfeaz)). 

This completes the proof of (5.88). 

Next we prove (5.89). Since (5.87) ensures that ^~^(misclassification|t) 
can be well approximated by E{t), we only need to prove 

(5.90) sup|o<t<^p+,-4 E{t) < ^ sup|o<t<^^+g^| Sep{t){l + o(l)). 

Then, $(i supg^^^^^^^^ 5'ep(f)(l + o(l))) provides a lower bound for the 
misclassification rate P(YLt(X,^l) < 0\t) for < t < Tp + Sp. Taking 
tp = Tidg^ai in Lemma 5.15 and noting that Tideai G (0, Tp + Sp] shows 
P{YLt{X,n) < ^deai) = msvi]io<t<r,+s,s7p{t){l + o{l))). Combining 
these yields (5.89). ^ 

We now proceed to prove (5.90). Define Ap = {t : t ^ (0, Tp + Sp], Sep{t) < 
^ supo<f<^p+3p{S'ep(t)}}. Then by Lemma 5.16 and (5.87), for large enough 
P, 

suPieAp E{t) < (1 + o(l))^ supo<t<^p+,-p{5ep(i)}. 
So it remains to show that uniformly for all t £ Ap = (0, Tp + Sp] \ Ap, 

(5.91) E{t) < {1 + o{l))l sup S^p{t). 

2 0<t<Tp+Sp 



69 



We proceed to prove the above claim (5.91). Introduce the event 
_ . \Mp{t)-m,{t)\ \v (t)-vM 

Bp - {sup 7-r < LpP , sup ^— < LpP j. 

teA- mp{t) t&A- pF{t) 

where k = {1 — 0)/2 — 5{f3,r) > 0. The proof has two steps: (a) show that 
P{Bp) > 1 — o(l/p), and then (b) show that on the event Bp, the desired 
claim in the lemma holds. 

We first show (a). Recall that we have proved in (2.13) that supo<j<^2iogp 

Sepit) 

Lpp'^. By Lemma 5.4 and (5.78), Vp{t) > CpF{t) with some constant C > 
0, where the value of C depends on whether r < /5 or r > /3. More- 
over, by definition of A^, Sep{t) > ^Lpp~^ for t G Ap. It follows that 

mp{t) = \\/vp{t)Sep{t) > \J CpF{t)Lpp'^ . On the other hand, by Lemma 

5.12 mp{t) < Lpp^~^I^F{t), so we can derive pF{t) > Lpp^'^^^ and conse- 
quently, ^Jri^mp{t) > Lpp^'^^^ and Vp{t) > Lpp^'^'^^ . By Lemma 5.14 and 
using similar arguments as those in Lemma 5.13, we can prove that for each 



mp{t) J p-^ \ Vp[t) J p-^ 



Using the grid point method as that for proving (3.1) shows that P{Bp) > 
l-o(l/p). 

We now show (b). On the event Bp, 

(5.92) 



Mp{t)NVp{t) < (1 + Lpp~^)mp{t)/Jvp{t) < (1 + Lpp-^)- sup Sep{t) 

^ ^ 0<t<rp+Sp 

This together with the definition of i?(i) completes the proof of claim (b). 
By (5.92), uniformly over all < t < Tp -|- Sp, 

P{YLt{X,n) < 0\t) > ^({l + Lpp~^)l sup S^p{t))p{Bp) 

^ ^ 0<t<Tp+Sp ' 

■>^[{\^Lpp--)\ sup 5^(t))(l-o(-)). 

This, together with (5.87), proves (5.91) and completes the proof of the 
theorem. □ 
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5.25. Proof of Lemma 5.15. Write for short Sep{t) = Sep{t,ep,Tp,il:), 
Mp{t) = Mp{t,Z,fi), Vp{t) = Vp{t,Z,n), mp{t) = mp{t,ep,Tp,Q), and 
Vp{t) = Vp{t,ep,Tp,^l). Define event 

Bp = {\Vp{tp)-Vp{tp)\ < Lpp-^/'^pF{tp), \Mp{tp)-mpitp)\ < Lpp~^''^mp{tp)] . 
The key is to first show that (a) 

(5.93) P{B;) < exp ( - i \og{p){S^p{tp)f ■ (1 + o(l)) , 

and then show that (b) on the event Bp. Combining (a) and (b) proves the 
desired claim holds. 

We first prove claim (a). Note that by Lemma 5.4, Vp{t) > CpF{t) with 
some constant C > 0, where the value of C depends on whether r > f3 
or r < /3. Further by Lemma 5.12, < 

CK^{logpf/^Vp{t), andsothat ^mp{t) > Cnpml{t)/[K^{logpf/'^Vp{t)] = 

;,.(g;)3/. (^W)^- Taking A, = Kp{logp) { ^^i^ ) ''^^p^^v)^ ^hen < 
Lpmp(tp). It follows that P(y/n^|Mp(tp)-mp(tp)| > CK^{\og{p))'^-Lpmp{tp)) < 
P{y/rh^\Mp{tp) - mp{tp)\ > C K^{\og{p))'^ \p) , where by Lemma 5.14, the 
right hand side 

< CKl{\ogpfe^Y> ( - {S^p{tp))\\ogp)). 

Since Sep{tp) > Lpp'^ — t- oo, it follows easily that 
(5.94) 

P{\Mpitp)-mpitp)\ > Lpp-^'^mp{tp)) < exp (- (5^(tp))'(logp)(l+o(l))) . 



Next we consider Vp{t). Let Xp = Sep{tp)y {log p)KppF{tp). Using the 

same technique as for proving (5.94) we obtain that Xp < Lpp~^/'^pF{t). 

Further, by Lemma 5.14 we have 

(5.95) 

P{\Vp{tp)-Vp{tp)\ > Lpp-'/^pF{tp))<exp(-{S^p{tp))\logp){l + o{l))). 



Combing (5.94) with (5.95) proves (5.93). 

On the set Bp, since Vp{tp) > CpF{tp) by Lemma 5.4, we have 

1 + 7^ = 1+'^(1)- Therefore, 



(5.96) = i^^M (i + 0(1)) = Sep{tp) (1 + o(l)) . 
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Combining (5.93) with (5.96), the misclassification rate can be bounded 



as 



PiYLtiX, n) < 0\tp) < I. Q5ep(tp)(l + 0(1))) + PiB^) < ^ Q5ep(tp)(l - o(l))) , 



and 



P{YLt{X,n) < 0\tp) > I. Q5ep(tp)(l + 0(1))) P{Bp) > $ Qsep(tp)(l + o(l))) 
Thus the claim follows easily. □ 



5.26. Proof of Lemma 5.16. Recall that P{YLt{X,n) < 0\t) = ^{Mp{t)/^Vp{t)). 
By (5.87), to prove the lemma, it suffices to prove that uniformly for all 
t G Ap, 



(5.97) Mp{t)/Jvjt) < (1 + 0(1)) J sup Sep{t). 

Write Sep{t) = Sep{t, ep,Tp, 17) for short. We consider the cases (a) pF{t) > 
Kl{\ogp)\ ^mp{t) > Kl{\ogp)\ (b) ^mp(t) < ^^(logp)^ pF{t) > 

Kp{logpy , and (c)y^mp(t) > Kj(logp)^, pF{t) < Kp{logpy separately. 
For case (a), define the event 

\Mp{t) - mpit)\ ^ 1 \v (^t)-vp{t)\ ^ 1 ^ 

Bp = { sup < —;= , sup < | . 

teAp mp{t) Vlogp t€Ap pF{t) Vlogp 

We will first prove P{B^p) < o(l/p). Let A = Ap = Ci^-3(log p)-^'"^ ^mp{t) 
with C > some constant. Then by Lemma 5.14, using similar arguments 
as those in Lemma 5.13 we obtain that with probability at least 1 — o{p~^), 
\^p{t) ~ "T'p(OI — (logp)~"^'^^7Tip(t). Using the grid points method as that in 
Theorem 3.1, we can prove that except for a probability of o(l/p), 

sup \Mp{t) - rnp[t)\ ^ 

i6Ap,^mp(t)>(logp)i9 mp{t) 



As for Vp{t), using similar argument and by Lemma 5.13 we obtain that 
with probability at least 1 — o{l/p), 

(5.98) sup \Mi)-Mt)\ <(iogp)-i/2. 

teAp,pF(f)>(iogp)i9 pFit) 
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Thus we have proved the desired claim that P{Bp) > 1 — o{l/p). 

Next by Lemma 5.4, pF{t)/vp(t) < C for ah < t < Tp + Sp. Then on the 
event Bp, 



mp{t) Vlogp Vp{t) VlogP Vp{t) 

where the o(l) is uniformly over all t. Therefore, for any t G Ap, 



Mp{t)/J%{t) = (1 + o{l))mp{t)/J^) < (1 + 0(1))^ sup Sep{t), 

and (5.97) has been proved. 

Now we consider case (b). By the proof of Lemma 5.13 we obtain that 

~ —1/2 

except for a probability of o(l/p), for any t € Ap, Mp{t) < mp{t)+Lpnp < 
LpHp^^"^. Since we assumed that pF{t) > Kp{logpy , by (5.98) and the 
same argument as that for (5.84), we have ^^|jy = 1 + o( ^^^p )- Since by 
lemma 5.4, Vp{t) > CpF{t) > C(logp)-^/2 ^^^i^ 

some constant C > whose 
value depends on whether r > /3 or r < /?. Thus, with probability at least 
1 — o{l/p), for any t G Ap, 



(5.99) Mp{t)/^V^) < LpTip^/^/^^) < LpTip^l^. 

Thus, the claim in the lemma follows automatically by the assumption that 
suptg^p{5'ep(t)} > p'^ with k > 0. 

Finally we consider case (c). By Lemma 3.1, pFp{t) < Lp. Thus, using the 
same arguments as those for proving Lemma 3.4, part (lb) we obtain that 



(5.100) Mp{t)/^JVp{t) < LpUp^l^. 

Using similar arguments as in case (b), we prove that (5.97) continue to hold 
in case (c). This completes the proof of the lemma. □ 

6. Appendix. 

6.1. Proof of Lemma 5.12. Recall that $ = 1 — ^> is the survival function 
of A^(0, 1). The following lemma is proved below. 

Lemma 6.1. For any t > and u > 0, there are universal constants 
Ci > and C2 > 1 such that Ci mm{t, 1} < i • ^^^~"|~^^*+"| < 02(1 + 1). 
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We now show Lemma 5.12. Let fl = Ufi for short. First, by definitions, 
V^rnpit) = Y:';^,E[^f,{j)sgn{~z{j))l{\Z{j)\ > t}] = E^=i 
t/^/^O)) ~ ^{t + y/>hif^{j)))]- Noting that for any fixed t > 0, u[$(t — u) — 
^(i + u)] is a symmetric function, 



(6.1) ^,mp{t) = Y,E[ 
i=i 

Similarly, we have 



T^tiij)\mt - \^fi{j)\) - Ht + 1 v^A(j)i))]- 



(6.2) UpUpit) = ^ E[npfi''{j)mt 



Since that is Kp-sparse and that \^yn^^J■{j)\ < Tp < \J2 log(p), |y^/i(j)| = 
YX=iWM ■ < i^p\/21og(p). Comparing (6.1) and (6.2), the 

first claim follows by Lemma 6.1. The second claim follows easily from the 
first claim and that |-^/?^/i(i)| < KpTp. □ 

6.2. Proof of Lemma 6.1. Consider the first inequality first. Let </>(•) be 
the density of A^(0, 1). For any real number write 



^{i - v) !^ + {t- v))dx 



(t>{t - v) 



(pit + v) 



where the right hand side is strictly monotone in v. Therefore, ^(t — u) /4>{t- 
u) > ^{t + u)/(f){t + u) or equivalently, ^{t + u)/^{t-u) < (pit + u) / (j){t - u) 
Combining this with basic algebra, 
(6.3) 

^(t-u) -^{t + u) 



^{t -u) + ^{t + u) 



> 



1 


'(j){t -u)- <j){t + u) 


t 




u 


(pit - u) + (pit + u) 


ut 





When < ui < 1, the right hand side > t-info<x<i{ 



1 e^- 



r}. When ut > 1, 



by the monotonicity of the function (e' 
side > il/u) ■ [(e*" ""^"^ z^"*" ' 



e ^)/(e^ + e ^), the right hand 



Ci = min{info<x<i{ 



e--)/(e*" + e-*")] > (1/n) • [(e-e-i)/(e + e-^)]. Letting 

1 



— }, (e — e "^)/(e + e ^)} gives the claim. 



Consider the second inequality. When u > 1, the claim follows trivially, 
so we consider the case < u < 1 only. By Taylor expansion, there is a 
constant C3 > 1 such that 



.4) i*<' 



u)-^it + u) 2nmax{t_„<^<t+„}{0(s)} (Pit 
^ — ^ S C3 



u $(t -u) + $(t + u) 



$(t - u) 



^it - u) ' 
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where in the second inequahty we have used t > and u < 1. At the 
same time, By Mills' ratio [41], there is a constant C4 > such that ^{t) < 
C4- {t(j)(t)). Therefore, (j){t - u)/^(t - u) < C4{l + \t-u\) < 2ci{l + t). Insert 
this into (6.4). The claim follows by letting C2 = max{l, 2C3C4}. □ 

6.3. Proof of Lemma 5.3. Note that P{\X\ > t, \Y\ > t) = P{X > 
t,Y >t) + P{-X >t,Y>t) + P{X >t,-Y>t)+ P{-X >t,-Y>t) = 
I1 + I2 + I3 + h. Consider I3. Define Y = 2t -Y. Then {X, Y) has joint 
normal distribution with mean (0, r) and correlation —p. Since r > 0, it is 
seen that I3 = P{X >t,Y>t + 2T)< P{X >t,Y>t). Similarly, we can 
obtain that I4 < P{X > t,Y > t) with X = -X and Y = 2t -Y. So we 
only need to bound Ii and 12- 

Since the proofs are similar, we only show the case p > 0. Write P{X > 
t\Y >t) = P{X >t,Y> t)/P{Y > t). First, by elementary calculus, 

"^^"^ - - - 1 ceM- '~'%i';.r' i - ^ p'- 

Second, note that when < t < r, P{Y > t) > 1/2, and that when t > t, 
P{Y >t) = $(t-T) > C[l+(t-T)]- V(i-r) (e.g. by Mills' ratio [41]), where 
we note that [1 + (t — r)]^"*^ > (1 + t)~^. Combining these with elementary 
algebra, 

Cexp(-tV2), < t < r, 

P{X >t\Y >t) < { C{l + t)exp{- ''-%-^^' ), T<t<j^ 



C(l + t)exp(- ((\-^y ), t>^r. 



Since < p < a, the claim follows by basic algebra. □ 

6.4. Proof of Lemma 5.9. Write h{t) = ^{t)/(j){t) for short. For positive 
functions f{t) and g{t) defined over (0, cxd), we say that f{t) x g{t) if there 
are constants C2 > Ci > such that Ci < f{t)/g{t) < C2 for all t > 0. The 
following claims can be proved by elementary calculus and Mills' ratio [41] so 
we omit the proof, (a) h{t) x Cmin{l, 1/t}, (b) h'{t)/h{t) = t- l/h{t) and 
(t-i-t-3) < h{t) < {t-^-t-'^+6t-^), and (c) h'{-t)/h{-t) < -C7max{l,t} 
for all t>0. 

To show the lemma, it suffices to show that m2{t) < for all t > 0. Write 

1 - Tp) + ^{t + Tp) _ 1 h{t - Tp)(t>{t - Tp) + h{t + Tp)^{t + Tp 

m2[t) — 



h{t) (t>{t - Tp) + (/>(t + Tp) h{t) <f){t - Tp) + (/)(t + Tp 

We show this for the case oi t > Tp and the case of t < Tp separately. 
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Consider the first case. By direct calculations, it is seen 
(6.5) 

""^C*) = Y:^^^Mt-rp)/h{t)]+j^-^^[h{t+Tp)/h{t)] = m2a{t)+m2b{t). 

Write for short ^(t) = h'{t - Tp)/h{t - Tp) - h'{t)/h(t). By (a)-(b) and direct 
calculations, 

\m'2,it)\ < CTpe-'-", m'^ait) = miKt - rp)/h{t)] + 0(rpie-^-*), 

where we note h{t — Tp)/h[t) > C. Note that the claim follows trivially 
if t < Tp + 3. Therefore, to show the claim, it is sufficient to show ^(t) < 
— Ct~^ min{l, (Tp/t)'^} for all t > rp + 3. Toward this end, note that by basic 
algebra and (b), 

m - -^p—f^^f—^+^tj ^ """^"(1- (t-rp)-2 + 6(t-Tp)-4)+T3F2- 
By basic algebra, we have that for sufficiently large Tp and t > Tp + 3, 

The claim now follows from elementary calculus. 
Consider the second case. Rewrite 

and so 

m2(t) = m'2c{t)h{t - Tp) + m2c{t)h'{t - Tp) + m2^(t). 
Similarly, by (a)-(c), 

\m2dit)\ < Ct~^, m'^cit) < C, m2c{t)h\t-Tp) < -Cmax{l, t}-max{l, (Tp-t)}/i(t- 

Combining these gives 

m2{t) < C[- max{l, t} • max{l, (rp - t)} + C]h{t - Tp) + C. 

Since h{t — Tp) > C, it is seen that m2{t) < for sufficiently large Tp and 
the claim follows. 

The second claim m2{t) > 1 follows directly from the first claim and 
limt_s>oo (t) = 1, which can be obtained immediately by (6.5). □ 
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